Open Tabs
- 031-data-wrangling-with-mongodb.ipynb
- 032-linear-regression-with-time-series-data.ipynb
- 033-autoregressive-models.ipynb
- 034-arma-models-and-hyperparameter-tuning.ipynb
- 17-ts-core.ipynb
- 035-assignment.ipynb
- 04-pandas-advanced.ipynb
- 18-ts-models.ipynb
- 11-databases-mongodb.ipynb
- 10-databases-sql.ipynb
Kernels
- 10-databases-sql.ipynb
- 031-data-wrangling-with-mongodb.ipynb
- 18-ts-models.ipynb
- 04-pandas-advanced.ipynb
- 034-arma-models-and-hyperparameter-tuning.ipynb
- 17-ts-core.ipynb
- 035-assignment.ipynb
- 032-linear-regression-with-time-series-data.ipynb
- 033-autoregressive-models.ipynb
Terminals
- .ipynb_checkpoints5 days ago
- images3 days ago
- 031-data-wrangling-with-mongodb.ipynb4 days ago
- 032-linear-regression-with-time-series-data.ipynb12 days ago
- 033-autoregressive-models.ipynb3 minutes ago
- 034-arma-models-and-hyperparameter-tuning.ipynb3 days ago
- 035-assignment.ipynb3 days ago
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
Databases: PyMongo
Working with PyMongo¶
For all of these examples, we're going to be working with the "lagos" collection in the "air-quality" database. Before we can do anything else, we need to bring in pandas (which we won't use until the very end), pprint (a module that lets us see the data in an understandable way), and PyMongo (a library for working with MongoDB databases).
from pprint import PrettyPrinterDatabases¶
Data comes to us in lots of different ways, and one of those ways is in a database. A database is a collection of data.
Servers and Clients¶
Next, we need to connect to a server. A database server is where the database resides; it can be accessed using a client. Without a client, a database is just a collection of information that we can't work with, because we have no way in. We're going to be learning more about a database called MongoDB, and we'll use PrettyPrinter to make the information it generates easier to understand. Here's how the connection works:
client = MongoClient(host="localhost", port=27017)Semi-structured Data¶
Databases are designed to work with either structured data or semi-structured data. In this part of the course, we're going to be working with databases that contain semi-structured data. Data is semi-structured when it has some kind of organizing logic, but that logic can't be displayed using rows and columns. Your email account contains semi-structured data if it’s divided into sections like Inbox, Sent, and Trash. If you’ve ever seen tweets from Twitter grouped by hashtag, that’s semi-structured data too. Semi-structured data is also used in sensor readings, which is what we'll be working with here.
Exploring a Database¶
So, now that we're connected to a server, let's take a look at what's there. Working our way down the specificity scale, the first thing we need to do is figure out which databases are on this server. To see which databases on the server, we'll use the list_databases method, like this:
pp.pprint(list(client.list_databases()))[ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
{'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
{'empty': False, 'name': 'config', 'sizeOnDisk': 98304},
{'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
{'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
It looks like this server contains four databases: "admin", "air-quality", "config", and "local". We're only interested in "air-quality", so let's connect to that one:
db = client["air-quality"]In MongoDB, a database is a container for collections. Each database gets its own set of files, and a single MongoDB server typically has multiple databases.
Collections¶
Let's use a for loop to take a look at the collections in the "air-quality" database:
for c in db.list_collections():system.views lagos system.buckets.lagos nairobi system.buckets.nairobi dar-es-salaam system.buckets.dar-es-salaam
As you can see, there are three actual collections here: "nairobi", "lagos", and "dar-es-salaam". Since we're only interested in the "lagos" collection, let's get it on its own like this:
lagos = db["lagos"]Documents¶
A MongoDB document is an individual record of data in a collection, and is the basic unit of analysis in MongoDB. Documents come with metadata that helps us understand what the document is; we'll get back to that in a minute. In the meantime, let's use the count_documents method to see how many documents the "lagos" collection contains:
lagos.count_documents({})166496
Practice
Try it yourself! Bring in all the necessary libraries and modules, then connect to the "air-quality" database and print the number of documents in the "nairobi" collection.
client = MongoClient(host = "localhost", port = 27017)[ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
{'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
{'empty': False, 'name': 'config', 'sizeOnDisk': 98304},
{'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
{'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
system.views
lagos
system.buckets.lagos
nairobi
system.buckets.nairobi
dar-es-salaam
system.buckets.dar-es-salaam
202212
Retrieving Data¶
Now that we know how many documents the "lagos" collection contains, let's take a closer look at what's there. The first thing you'll notice is that the output starts out with a curly bracket ({), and ends with a curly bracket (}). That tells us that this information is a dictionary. To access documents in the collection, we'll use two methods: find and find_one. As you might expect, find will retrieve all the documents, and find_one will bring back only the first document. For now, let's stick to find_one; we'll come back to find later.
Just like everywhere else, we'll need to assign a variable name to whatever comes back, so let's call this one result.
result = lagos.find_one({}){ '_id': ObjectId('6334b0f18c51459f9b1d955d'),
'metadata': { 'lat': 6.501,
'lon': 3.367,
'measurement': 'temperature',
'sensor_id': 10,
'sensor_type': 'DHT11',
'site': 4},
'temperature': nan,
'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 88000)}
Key-Value Pairs¶
There's a lot going on here! Let's work from the bottom up, starting with this:
{
'temperature': 27.0,
'timestamp': datetime.datetime(2017, 9, 6, 13, 18, 10, 120000)
}
The actual data is labeled temperature and timestamp, and if seeing it presented this way seems familiar, that's because what you're seeing at the bottom are two key-value pairs. In PyMongo, "_id" is always the primary key. Primary keys are the column(s) which contain values that uniquely identify each row in a table; we'll talk about that more in a minute.
Metadata¶
Next, we have this:
'metadata': { 'lat': 6.602,
'lon': 3.351,
'measurement': 'temperature',
'sensor_id': 9,
'sensor_type': 'DHT11',
'site': 2}
This is the document's metadata. Metadata is data about the data. If you’re working with a database, its data is the information it contains, and its metadata describes what that information is. In MongoDB, each document often has metadata of its own. If we go back to the example of your email account, each message in your Sent folder includes both the message itself and information about when you sent it and who you sent it to; the message is data, and the other information is metadata.
The metadata we see in this block of code tells us what the key-value pairs from the last code block mean, and where the information stored there comes from. There's location data, a line telling us what about the format of the key-value pairs, some information about the equipment used to gather the data, and where the data came from.
Identifiers¶
Finally, at the top, we have this:
{
'_id': ObjectId('6126f1780e45360640bf240a')
}
This is the document's unique identifier, which is similar to the index label for each row in a pandas DataFrame.
Practice
Try it yourself! Retrieve a single document from the "nairobi" collection, and print the result.
result = nairobi.find_one({}){ 'P1': 39.67,
'_id': ObjectId('6334b0e98c51459f9b198d27'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
Analyzing Data¶
Now that we've seen what a document looks like in this collection, let's start working with what we've got. Since our metadata includes information about each sensor's "site", we might be curious to know how many sites are in the "lagos" collection. To do that, we'll use the distinct method, like this:
lagos.distinct("metadata.site")[3, 4]
Notice that in order to grab the "site" number, we needed to include the "metadata" tag.
This tells us that there are 2 sensor sites in Lagos: one labeled 3 and the other labeled 4.
Let's go further. We know that there are two sensor sites in Lagos, but we don't know how many documents are associated with each site. To find that out, we'll use the count_documents method for each site.
print("Documents from site 3:", lagos.count_documents({"metadata.site": 3}))Documents from site 3: 140586 Documents from site 4: 25910
Practice
Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, and how many documents are associated with each one.
print("Documents from site 29:", nairobi.count_documents({"metadata.site":29}))Documents from site 29: 131852 Documents from site 6: 70360
print("Documents from site 29:", nairobi.count_documents({"metadata.site": 29}))Documents from site 29: 131852 Documents from site 6: 70360
Now that we know how many documents are associated with each site, let's keep drilling down and find the number of readings for each site. We'll do this with the aggregate method.
Before we run it, let's take a look at some of what's happening in the code here. First, you'll notice that there are several dollar signs ($) in the list. This is telling the collection that we want to create something new. Here, we're saying that we want there to be a new group, and that the new group needs to be updated with data from metadata.site, and then updated again with data from count.
There's also a new field: "_id". In PyMongo, "_id" is always the primary key. Primary keys are the fields which contain values that uniquely identify each row in a table.
Let's run the code and see what happens:
[{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}][{'_id': 4, 'count': 25910}, {'_id': 3, 'count': 140586}]
With that information in mind, we might want to know what those readings actually are. Since we're really interested in measures of air quality, let's take a look at the P2 values in the "lagos" collection. P2 measures the amount of particulate matter in the air, which in this case is something called PM 2.5. If we wanted to get all the documents in a collection, we could, but that would result in an unmanageably large number of records clogging up the memory on our machines. Instead, let's use the find method and use limit to make sure we only get back the first 3.
result = lagos.find({"metadata.measurement": "P2"}).limit(3)[ { 'P2': 14.42,
'_id': ObjectId('6334b0f28c51459f9b1de145'),
'metadata': { 'lat': 6.501,
'lon': 3.367,
'measurement': 'P2',
'sensor_id': 6,
'sensor_type': 'PPD42NS',
'site': 4},
'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
{ 'P2': 19.66,
'_id': ObjectId('6334b0f28c51459f9b1de146'),
'metadata': { 'lat': 6.501,
'lon': 3.367,
'measurement': 'P2',
'sensor_id': 6,
'sensor_type': 'PPD42NS',
'site': 4},
'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
{ 'P2': 24.79,
'_id': ObjectId('6334b0f28c51459f9b1de147'),
'metadata': { 'lat': 6.501,
'lon': 3.367,
'measurement': 'P2',
'sensor_id': 6,
'sensor_type': 'PPD42NS',
'site': 4},
'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
Practice
Try it yourself! Find out how many sensor sites are in Nairobi, what their labels are, how many documents are associated with each one, and the number of observations from each site. Then, return the first three documents with the value P2.
[{"$group": {"_id": "$metadata.site", "count": {"$count": {}}}}][{'_id': 29, 'count': 131852}, {'_id': 6, 'count': 70360}]
[ { 'P2': 14.42,
'_id': ObjectId('6334b0f28c51459f9b1de145'),
'metadata': { 'lat': 6.501,
'lon': 3.367,
'measurement': 'P2',
'sensor_id': 6,
'sensor_type': 'PPD42NS',
'site': 4},
'timestamp': datetime.datetime(2018, 1, 7, 7, 7, 3, 39000)},
{ 'P2': 19.66,
'_id': ObjectId('6334b0f28c51459f9b1de146'),
'metadata': { 'lat': 6.501,
'lon': 3.367,
'measurement': 'P2',
'sensor_id': 6,
'sensor_type': 'PPD42NS',
'site': 4},
'timestamp': datetime.datetime(2018, 1, 7, 7, 11, 23, 870000)},
{ 'P2': 24.79,
'_id': ObjectId('6334b0f28c51459f9b1de147'),
'metadata': { 'lat': 6.501,
'lon': 3.367,
'measurement': 'P2',
'sensor_id': 6,
'sensor_type': 'PPD42NS',
'site': 4},
'timestamp': datetime.datetime(2018, 1, 7, 7, 21, 53, 981000)}]
So far, we've been dealing with relatively small subsets of the data in our collections, but what if we need to work with something bigger? Let's start by using distinct to remind ourselves of the kinds of data we have at our disposal.
lagos.distinct("metadata.measurement")['temperature', 'P1', 'humidity', 'P2']
There are also comparison query operators that can be helpful for filtering the data. In total, we have
$gt: greater than (>)$lt: less than (<)$gte: greater than equal to (>=)$lte: less than equal to (<= )
Let's use the timestamp to see how we can use these operators to select different documents:
result = nairobi.find({"timestamp": {"$gt": datetime.datetime(2018, 9, 1)}}).limit(3)[ { 'P1': 39.67,
'_id': ObjectId('6334b0e98c51459f9b198d27'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
{ 'P1': 39.13,
'_id': ObjectId('6334b0e98c51459f9b198d28'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
{ 'P1': 30.07,
'_id': ObjectId('6334b0e98c51459f9b198d29'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
result = nairobi.find({"timestamp": {"$lt": datetime.datetime(2018, 12, 1)}}).limit(3)[ { 'P1': 39.67,
'_id': ObjectId('6334b0e98c51459f9b198d27'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
{ 'P1': 39.13,
'_id': ObjectId('6334b0e98c51459f9b198d28'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
{ 'P1': 30.07,
'_id': ObjectId('6334b0e98c51459f9b198d29'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
{"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}[ { 'P1': 39.67,
'_id': ObjectId('6334b0e98c51459f9b198d27'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
{ 'P2': 34.43,
'_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
Practice
Try it yourself! Find three documents with timestamp greater than or equal to and less than or equal the date December 12, 2018 — datetime.datetime(2018, 12, 1, 0, 0, 6, 767000).
result = nairobi.find({"timestamp":{"$gte":datetime.datetime(2018,12,1,0,0,6,767000)}}).limit(3)[ { 'P1': 17.08,
'_id': ObjectId('6334b0e98c51459f9b19eba8'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 12, 1, 0, 0, 6, 767000)},
{ 'P1': 17.62,
'_id': ObjectId('6334b0e98c51459f9b19eba9'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 12, 1, 0, 5, 6, 327000)},
{ 'P1': 11.05,
'_id': ObjectId('6334b0e98c51459f9b19ebaa'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 12, 1, 0, 10, 5, 579000)}]
# Less than or equal toUpdating Documents¶
We can also update documents by passing some filter and new values using update_one to update one record or update_many to update many records. Let's look at an example. Before updating, we have this record showing like this:
{"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}[ { 'P1': 39.67,
'_id': ObjectId('6334b0e98c51459f9b198d27'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
Now we are updating the sensor type from "SDS011" to "SDS", we first select all records with sensor type equal to "SDS011", then set the new value to "SDS":
{"metadata.sensor_type": {"$eq": "SDS101"}},Now we can see all records have changed:
{"timestamp": {"$eq": datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}}[ { 'P1': 39.67,
'_id': ObjectId('6334b0e98c51459f9b198d27'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
{ 'P2': 34.43,
'_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
We can change it back:
{"$set": {"metadata.sensor_type": "SDS101"}},result.raw_result{'n': 0, 'nModified': 0, 'ok': 1.0, 'updatedExisting': False}Aggregation¶
Since we're looking for big numbers, we need to figure out which one of those dimensions has the largest number of measurements by aggregating the data in each document. Since we already know that site 3 has significantly more documents than site 2, let's start looking at site 3. We can use the $match syntax to only select site 3 data. The code to do that looks like this:
{"$group": {"_id": "$metadata.measurement", "count": {"$count": {}}}},[ {'_id': 'temperature', 'count': 35147},
{'_id': 'P1', 'count': 35146},
{'_id': 'P2', 'count': 35146},
{'_id': 'humidity', 'count': 35147}]
Practice
Try it yourself! Find the number of each measurement type at site 29 in Nairobi.
pp.pprint(list(result))After aggregation, there is another useful operator called $project, which allows you to specify which fields to display by adding new fields or deleting fields. Using the Nairobi data from site 29, we can first count each sensor type:WQU WorldQuant University Applied Data Science Lab QQQQ
{"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},[{'_id': 'DHT22', 'count': 66038}, {'_id': 'SDS011', 'count': 65814}]
We can see there are two sensor types and the corresponding counts. If we only want to display what are the types but do not care about the counts, we can suppress the count filed by setting it at 0 in $project:
{"$group": {"_id": "$metadata.sensor_type", "count": {"$count": {}}}},[{'_id': 'DHT22'}, {'_id': 'SDS011'}]
The $project syntax is also useful for deleting the intermediate fields that we used to generate our final fields but no longer need. In the following example, let's calculate the date difference for each sensor type. We'll first use the aggregation method to get the start date and last date.
"date_min": {"$min": "$timestamp"},[ { '_id': 'DHT22',
'date_max': datetime.datetime(2018, 12, 31, 23, 57, 37, 257000),
'date_min': datetime.datetime(2018, 9, 1, 0, 0, 4, 301000)},
{ '_id': 'SDS011',
'date_max': datetime.datetime(2018, 12, 31, 23, 55, 4, 811000),
'date_min': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}]
Then we can calculate the date difference using $dateDiff, which gets the date difference through specifying the start date, end date and unit for timestamp data. We can see from the results above that the dates, are very close to each other. The only differences are in the minutes, so we can specify the unit as minute to show the difference. Since we don't need the start date and end dates, we can define a "dateDiff" field inside $project, so that it will be shown in the final display:
"date_min": {"$min": "$timestamp"},[{'_id': 'DHT22', 'dateDiff': 175677}, {'_id': 'SDS011', 'dateDiff': 175675}]
If we specify unit as day, it will show the difference between the dates:
"date_min": {"$min": "$timestamp"},Practice
Try it yourself find the date difference for each measurement type at site 29 in Nairobi.
pp.pprint(list(result))We can do more with the date data using $dateTrunc, which truncates datetime data. We need to specify the datetime data, which can be a Date, a Timestamp, or an ObjectID. Then we need to specify the unit (year, month, day, hour, minute, second) and binSize (numerical variable defining the size of the truncation). Let's check the example below, where we group data by the month using $dateTrunc and then count how many observations there are for each month.
"date": "$timestamp",Practice
Try it yourself! Truncate date by week and count at site 29 in Nairobi.
pp.pprint(list(result))Finishing Up¶
So far, we've connected to a server, accessed that server with a client, found the collection we were looking for within a database, and explored that collection in all sorts of different ways. Now it's time to get the data we'll actually need to build a model, and store that in a way we'll be able to use.
Let's use find to retrieve the PM 2.5 data from site 3. And, since we don't need any of the metadata to build our model, let's strip that out using the projection argument. In this case, we're telling the collection that we only want to see "timestamp" and "P2". Keep in mind that we limited the number of records we'll get back to 3 when we defined result above.
# `projection` limits the kinds of data we'll get back.{'P2': 0.01, 'timestamp': datetime.datetime(2018, 10, 1, 0, 0, 52, 906000)}
Finally, we'll use pandas to read the extracted data into a DataFrame, making sure to set timestamp as the index:
df = pd.DataFrame(result).set_index("timestamp")| P2 | |
|---|---|
| timestamp | |
| 2018-10-01 00:04:20.554 | 0.01 |
| 2018-10-01 00:07:47.504 | 0.01 |
| 2018-10-01 00:11:14.382 | 0.01 |
| 2018-10-01 00:14:41.239 | 0.01 |
| 2018-10-01 00:18:05.938 | 0.01 |
Practice
Try it yourself! Retrieve the PM 2.5 data from site 29 in Nairobi and strip out the metadata to create a DataFrame that shows only timestamp and P2. Print the result.
result = ...Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
3.1. Wrangling Data with MongoDB 🇰🇪
from IPython.display import VimeoVideoVimeoVideo("665412094", h="8334dfab2e", width=600)VimeoVideo("665412135", h="dcff7ab83a", width=600)Task 3.1.1: Instantiate a PrettyPrinter, and assign it to the variable pp.
pp = PrettyPrinter(indent = 2)Prepare Data¶
Connect¶
VimeoVideo("665412155", h="1ca0dd03d0", width=600)Task 3.1.2: Create a client that connects to the database running at localhost on port 27017.
client = MongoClient(host="localhost",port=27017)Explore¶
VimeoVideo("665412176", h="6fea7c6346", width=600)Task 3.1.3: Print a list of the databases available on client.
pp.pprint(list(client.list_databases()))[ {'empty': False, 'name': 'admin', 'sizeOnDisk': 40960},
{'empty': False, 'name': 'air-quality', 'sizeOnDisk': 7000064},
{'empty': False, 'name': 'config', 'sizeOnDisk': 110592},
{'empty': False, 'name': 'local', 'sizeOnDisk': 73728},
{'empty': False, 'name': 'wqu-abtest', 'sizeOnDisk': 585728}]
VimeoVideo("665412216", h="7d4027dc33", width=600)Task 3.1.4: Assign the "air-quality" database to the variable db.
db = client["air-quality"]VimeoVideo("665412231", h="89c546b00f", width=600)Task 3.1.5: Use the list_collections method to print a list of the collections available in db.
for c in db.list_collections():system.views lagos system.buckets.lagos nairobi system.buckets.nairobi dar-es-salaam system.buckets.dar-es-salaam
VimeoVideo("665412252", h="bff2abbdc0", width=600)Task 3.1.6: Assign the "nairobi" collection in db to the variable name nairobi.
nairobi = db["nairobi"]VimeoVideo("665412270", h="e4a5f5c84b", width=600)Task 3.1.7: Use the count_documents method to see how many documents are in the nairobi collection.
nairobi.count_documents({})202212
VimeoVideo("665412279", h="c2315f3be1", width=600)Task 3.1.8: Use the find_one method to retrieve one document from the nairobi collection, and assign it to the variable name result.
result = nairobi.find_one({}){ 'P1': 39.67,
'_id': ObjectId('6334b0e98c51459f9b198d27'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P1',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)}
VimeoVideo("665412306", h="e1e913dfd1", width=600)Task 3.1.9: Use the distinct method to determine how many sensor sites are included in the nairobi collection.
nairobi.distinct("metadata.site")[6, 29]
VimeoVideo("665412322", h="4776c6d548", width=600)Task 3.1.10: Use the count_documents method to determine how many readings there are for each site in the nairobi collection.
- Count the documents in a collection using PyMongo. WQU WorldQuant University Applied Data Science Lab QQQQ
print("Documents from site 29:",nairobi.count_documents({"metadata.site":29}))Documents from site 6: 70360 Documents from site 29: 131852
VimeoVideo("665412344", h="d2354584cd", width=600)Task 3.1.11: Use the aggregate method to determine how many readings there are for each site in the nairobi collection.
{"$group":{"_id":"$metadata.site","count":{"$count":{}}}}[{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
VimeoVideo("665412372", h="565122c9cc", width=600)Task 3.1.12: Use the distinct method to determine how many types of measurements have been taken in the nairobi collection.
nairobi.distinct("metadata.measurement")['P2', 'humidity', 'P1', 'temperature']
VimeoVideo("665412380", h="f7f7a39bb3", width=600)Task 3.1.13: Use the find method to retrieve the PM 2.5 readings from all sites. Be sure to limit your results to 3 records only.
result = nairobi.find({"metadata.measurement":"P2"}).limit(3)[ { 'P2': 34.43,
'_id': ObjectId('6334b0ea8c51459f9b1a0db2'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 0, 2, 472000)},
{ 'P2': 30.53,
'_id': ObjectId('6334b0ea8c51459f9b1a0db3'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 5, 3, 941000)},
{ 'P2': 22.8,
'_id': ObjectId('6334b0ea8c51459f9b1a0db4'),
'metadata': { 'lat': -1.3,
'lon': 36.785,
'measurement': 'P2',
'sensor_id': 57,
'sensor_type': 'SDS011',
'site': 29},
'timestamp': datetime.datetime(2018, 9, 1, 0, 10, 4, 374000)}]
VimeoVideo("665412389", h="8976ea3090", width=600)Task 3.1.14: Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 6.
{"$group":{"_id":"$metadata.measurement","count":{"$count":{}}}}[ {'_id': 'P2', 'count': 18169},
{'_id': 'humidity', 'count': 17011},
{'_id': 'P1', 'count': 18169},
{'_id': 'temperature', 'count': 17011}]
VimeoVideo("665412418", h="0c4b125254", width=600)Task 3.1.15: Use the aggregate method to calculate how many readings there are for each type ("humidity", "temperature", "P2", and "P1") in site 29.
{"$group":{"_id":"$metadata.measurement","count":{"$count":{}}}}[ {'_id': 'P2', 'count': 32907},
{'_id': 'humidity', 'count': 33019},
{'_id': 'P1', 'count': 32907},
{'_id': 'temperature', 'count': 33019}]
[]
Import¶
VimeoVideo("665412437", h="7a436c7e7e", width=600)Task 3.1.16: Use the find method to retrieve the PM 2.5 readings from site 29. Be sure to limit your results to 3 records only. Since we won't need the metadata for our model, use the projection argument to limit the results to the "P2" and "timestamp" keys only.
{"metadata.site":29,"metadata.measurement":"P2"},VimeoVideo("665412442", h="494636d1ea", width=600)Task 3.1.17: Read records from your result into the DataFrame df. Be sure to set the index to "timestamp".
df = pd.DataFrame(result).set_index("timestamp")| P2 | |
|---|---|
| timestamp | |
| 2018-09-01 00:00:02.472 | 34.43 |
| 2018-09-01 00:05:03.941 | 30.53 |
| 2018-09-01 00:10:04.374 | 22.80 |
| 2018-09-01 00:15:04.245 | 13.30 |
| 2018-09-01 00:20:04.869 | 16.57 |
assert isinstance(df.index, pd.DatetimeIndex), "`df` should have a `DatetimeIndex`."Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
Databases: SQL
from IPython.display import YouTubeVideoWorking with SQL Databases¶
A database is a collection of interrelated data. The primary goal of a database is to store and retrieve information in a convenient and efficient way. There are many types of databases. In this section, we will be dealing with a relational database. A relational database is a widely used database model that consists of a collection of uniquely named tables used to store information. The structure of a database model with its tables, constraints, and relationships is called a schema.
A Structured Query Language (SQL), is used to retrieve information from a relational database. SQL is one of the most commonly used database languages. It allows data stored in a relational database to be queried, modified, and manipulated easily with basic commands. SQL powers database engines like MySQL, SQL Server, SQLite, and PostgreSQL. The examples and projects in this course will use SQLite.
A table refers to a collection of rows and columns in a relational database. When reading data into a pandas DataFrame, an index can be defined, which acts as the label for every row in the DataFrame.
Connecting to a Database¶
ipython-sql¶
Magic Commands¶
Jupyter notebooks can run code that is not valid Python code but still affect the notebook . These special commands are called magic commands. Magic commands can have a range of properties. Some commonly used magic functions are below:
| Magic Command | Description of Command |
|---|---|
%pwd |
Print the current working directory |
%cd |
Change the current working directory |
%ls |
List the contents of the current directory |
%history |
Show the history of the In [ ]: commands |
We will be leveraging magic commands to work with a SQLite database.
ipython-sql¶
ipython-sql allows you to write SQL code directly in a Jupyter Notebook. The %sql (or %%sql) magic command is added to the beginning of a code block and then SQL code can be written.
Connecting with ipython-sql¶
We can connect to a database using the %sql magic function:
%sql sqlite:////home/jovyan/nepal.sqlitesqlite3¶
We can also connect to the same database using the sqlite3 package:
conn = sqlite3.connect("/home/jovyan/nepal.sqlite")Querying a Database¶
Building Blocks of the Basic Query¶
There are six common clauses used for querying data:
| Clause Name | Definition |
|---|---|
SELECT |
Determines which columns to include in the query's result |
FROM |
Identifies the table from which to query the data from |
WHERE |
filters data |
GROUP BY |
groups rows by common values in columns |
HAVING |
filters out unwanted groups from GROUP BY |
ORDER BY |
Orders the rows using one or more columns |
LIMIT |
Outputs the specified number of rows |
All clauses may be used together, but SELECT and FROM are the only required clauses. The format of clauses is in the example query below:
SELECT column1, column2
FROM table_name
WHERE "conditions"
GROUP BY "column-list"
HAVING "conditions"
ORDER BY "column-list"
SELECT and FROM¶
You can use SELECT * to select all columns in a table. FROM specifies the table in the database to query. LIMIT 5 will select only the first five rows.
Example
FROM id_mapYou can also use SELECT to select certain columns in a table
SELECT household_id,Practice
Try it yourself! Use SELECT to select the district_id column from the id_map table.
%%sqlWe can also assign an alias or temporary name to a column using the AS command. Aliases can also be used on a table. See the example below, which assigns the alias household_number to household_id
SELECT household_id AS household_number,Practice
Try it yourself! Use SELECT, FROM, AS, and LIMIT to select the first 5 rows from the id_map table. Rename the district_id column to district_number.
%%sqlFiltering and Sorting Data¶
SQL provides a variety of comparison operators that can be used with the WHERE clause to filter the data.
| Comparison Operator | Description |
|---|---|
| = | Equal |
| > | Greater than |
| < | Less than |
| >= | Greater than or equal to |
| <= | Less than or equal to |
| <> or != | Not equal to |
| LIKE | String comparison test |
For example, to select the first 5 homes in Ramechhap (district 2):
%%sqlPractice
Try it yourself! Use WHERE to select the row with household_id equal to 13735001
%%sqlAggregating Data¶
Aggregation functions take a collection of values as inputs and return one value as the output. The table below gives the frequently used built-in aggregation functions:
| Aggregation Function | Definition |
|---|---|
MIN |
Return the minimum value |
MAX |
Return the largest value |
SUM |
Return the sum of values |
AVG |
Return the average of values |
COUNT |
Return the number of observations |
Use the COUNT function to find the number of observations in the id_map table that come from Ramechhap (district 2):
WHERE district_id = 2Aggregation functions are frequently used with a GROUP BY clause to perform the aggregation on groups of data. For example, the query below returns the count of observations in each District:
GROUP BY district_id DISTINCT is a keyword to select unique records in a query result. For example, if we want to know the unique values in the district_id column:
SELECT distinct(district_id)Practice
Try it yourself! Use DISTINCT to count the number of unique values in the vdcmun_id column.
%%sqlDISTINCT and COUNT can be used in combination to count the number of distinct records. For example, if we want to know the number of unique values in the district_id column:
SELECT count(distinct(district_id))Practice
Try it yourself! Use DISTINCT and COUNT to count the number of unique values in the vdcmun_id column.
%%sqlJoining Tables¶
Joins link data from two or more tables together by using a column that is common between the two tables. The basic syntax for a join is below, where table1 and table2 refer to the two tables being joined, column1 and column2 refer to columns to be returned from both tables, and ID refers to the common column in the two tables.
SELECT table1.column1,
table2.column2
FROM table_1
JOIN table2 ON table1.id = table1.id
We'll explore the concept of joins by first identifying a single household that we'd like to pull in building information for. For example, let's say we want to see the corresponding foundation_type for the first home in Ramechhap (District 1). We'll start by looking at this single record in the id_map table.
WHERE district_id = 2This household has building_id equal to 23. Let's look at the foundation_type for this building, by filtering the building_structure table to find this building.
FROM building_structureTo join the two tables and limit the results to building_id = 23:
JOIN building_structure ON id_map.building_id = building_structure.building_idIn addition to the basic JOIN clause, specific join types can be specified, which specify whether the common column needs to be in one, both, or either of the two tables being joined. The different join types are below. The left table is the table specified first, immediately after the FROM clause and the right table is the table specified after the JOIN clause. If the generic JOIN clause is used, then by default the INNER JOIN will be used.
| JOIN Type | Definition |
|---|---|
INNER JOIN |
Returns rows where ID is in both tables |
LEFT JOIN |
Returns rows where ID is in the left table. Return NA for values in column, if ID is not in right table. |
RIGHT JOIN |
Returns rows where ID is in the right table. Return NA for values in column, if ID is not in left table. |
FULL JOIN |
Returns rows where ID is in either table. Return NA for values in column, if ID is not in either table. |
| WQU WorldQuant University Applied Data Science Lab QQQQ |
The video below outlines the main types of joins:
YouTubeVideo("2HVMiPPuPIM")Practice
Try it yourself! Use the DISTINCT command to create a column with all unique building IDs in the id_map table. LEFT JOIN this column with the roof_type column from the building_structure table, showing only buildings where district_id is 1 and limiting your results to the first five rows of the new table.
%%sqlUsing pandas with SQL Databases¶
To save the output of a query into a pandas DataFrame, we will use connect to the SQLite database using the SQLite3 package:
conn = sqlite3.connect("/home/jovyan/nepal.sqlite")To run a query using sqlite3, we need to store the query as a string. For example, the variable below called query is a string containing a query which returns the first 10 rows from the id_map table:
FROM id_mapTo save the results of the query into a pandas DataFrame, use the pd.read_sql() function. The optional parameter index_col can be used to set the index to a specific column from the query.
df = pd.read_sql(query, conn, index_col="building_id")Practice
Try it yourself! Use the pd.read_sql function to save the results of a query to a DataFrame. The query should select first 20 rows from the id_map table.
query = ...Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
3.2. Linear Regression with Time Series Data
from sklearn.linear_model import LinearRegressionVimeoVideo("665412117", h="c39a50bd58", width=600)Prepare Data¶
Import¶
VimeoVideo("665412469", h="135f32c7da", width=600)Task 3.2.1: Complete to the create a client to connect to the MongoDB server, assign the "air-quality" database to db, and assign the "nairobi" connection to nairobi.
client = MongoClient(host="localhost",port=27017)VimeoVideo("665412480", h="c20ed3e570", width=600)Task 3.2.2: Complete the wrangle function below so that the results from the database query are read into the DataFrame df. Be sure that the index of df is the "timestamp" from the results.
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")VimeoVideo("665412496", h="d757475f7c", width=600)Task 3.2.3: Use your wrangle function to read the data from the nairobi collection into the DataFrame df.
df = wrangle(nairobi)| P2 | P2.L1 | |
|---|---|---|
| timestamp | ||
| 2018-09-01 04:00:00+03:00 | 15.800000 | 17.541667 |
| 2018-09-01 05:00:00+03:00 | 11.420000 | 15.800000 |
| 2018-09-01 06:00:00+03:00 | 11.614167 | 11.420000 |
| 2018-09-01 07:00:00+03:00 | 17.665000 | 11.614167 |
| 2018-09-01 08:00:00+03:00 | 21.016667 | 17.665000 |
| 2018-09-01 09:00:00+03:00 | 22.589167 | 21.016667 |
| 2018-09-01 10:00:00+03:00 | 18.605833 | 22.589167 |
| 2018-09-01 11:00:00+03:00 | 14.022500 | 18.605833 |
| 2018-09-01 12:00:00+03:00 | 13.150000 | 14.022500 |
| 2018-09-01 13:00:00+03:00 | 12.806667 | 13.150000 |
assert any([isinstance(df, pd.DataFrame), isinstance(df, pd.Series)])VimeoVideo("665412520", h="e03eefff07", width=600)Task 3.2.4: Add to your wrangle function so that the DatetimeIndex for df is localized to the correct timezone, "Africa/Nairobi". Don't forget to re-run all the cells above after you change the function.
# df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")assert df.index.tzinfo == pytz.timezone("Africa/Nairobi")Explore¶
VimeoVideo("665412546", h="97792cb982", width=600)Task 3.2.5: Create a boxplot of the "P2" readings in df.
df["P2"].plot(kind="box",vert=False,ax=ax)<AxesSubplot:>
VimeoVideo("665412573", h="b46049021b", width=600)Task 3.2.6: Add to your wrangle function so that all "P2" readings above 500 are dropped from the dataset. Don't forget to re-run all the cells above after you change the function.
assert len(df) <= 32906VimeoVideo("665412594", h="e56c2f6839", width=600)Task 3.2.7: Create a time series plot of the "P2" readings in df.
fig,ax = plt.subplots(figsize=(15, 6))VimeoVideo("665412601", h="a16c5a73fc", width=600)Task 3.2.8: Add to your wrangle function to resample df to provide the mean "P2" reading for each hour. Use a forward fill to impute any missing values. Don't forget to re-run all the cells above after you change the function.
# df["P2"].resample("1H").mean().fillna(method="ffill").to_frame() #.isnull().sum()assert len(df) <= 2928VimeoVideo("665412649", h="d2e99d2e75", width=600)Task 3.2.9: Plot the rolling average of the "P2" readings in df. Use a window size of 168 (the number of hours in a week).
df["P2"].rolling(168).mean().plot(ax=ax);VimeoVideo("665412693", h="c3bca16aff", width=600)Task 3.2.10: Add to your wrangle function to create a column called "P2.L1" that contains the mean"P2" reading from the previous hour. Since this new feature will create NaN values in your DataFrame, be sure to also drop null rows from df.
assert len(df) <= 11686VimeoVideo("665412732", h="059e4088c5", width=600)Task 3.2.11: Create a correlation matrix for df.
df.corr()| P2 | P2.L1 | |
|---|---|---|
| P2 | 1.000000 | 0.650679 |
| P2.L1 | 0.650679 | 1.000000 |
VimeoVideo("665412741", h="7439cb107c", width=600)Task 3.2.12: Create a scatter plot that shows PM 2.5 mean reading for each our as a function of the mean reading from the previous hour. In other words, "P2.L1" should be on the x-axis, and "P2" should be on the y-axis. Don't forget to label your axes!
ax.plot([0,120],[0,120],linestyle="--",color="orange")[<matplotlib.lines.Line2D at 0x7f7eae8ce940>]
Split¶
VimeoVideo("665412762", h="a5eba496f7", width=600)Task 3.2.13: Split the DataFrame df into the feature matrix X and the target vector y. Your target is "P2".
feature=["P2.L1"]| P2.L1 | |
|---|---|
| timestamp | |
| 2018-09-01 04:00:00+03:00 | 17.541667 |
| 2018-09-01 05:00:00+03:00 | 15.800000 |
| 2018-09-01 06:00:00+03:00 | 11.420000 |
| 2018-09-01 07:00:00+03:00 | 11.614167 |
| 2018-09-01 08:00:00+03:00 | 17.665000 |
VimeoVideo("665412785", h="03118eda71", width=600)Task 3.2.14: Split X and y into training and test sets. The first 80% of the data should be in your training set. The remaining 20% should be in the test set.
X_train, y_train = X.iloc[:cutoff],y.iloc[:cutoff]len(X_train)+len(X_test)==len(X)True
Build Model¶
Baseline¶
Task 3.2.15: Calculate the baseline mean absolute error for your model.
mae_baseline = mean_absolute_error(y_train,y_pred_baseline)Mean P2 Reading: 9.27 Baseline MAE: 3.89
Iterate¶
Task 3.2.16: Instantiate a LinearRegression model named model, and fit it to your training data.
model = LinearRegression()LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Evaluate¶
VimeoVideo("665412844", h="129865775d", width=600)Task 3.2.17: Calculate the training and test mean absolute error for your model.
training_mae = mean_absolute_error(y_train,model.predict(X_train))Training MAE: 2.46 Test MAE: 1.8
Communicate Results¶
Task 3.2.18: Extract the intercept and coefficient from your model.
- Access an object in a pipeline in scikit-learnWQU WorldQuant University Applied Data Science Lab QQQQ
print(f"P2 = {intercept} + ({coefficient} * P2.L1)")P2 = 3.36 + (0.64 * P2.L1)
VimeoVideo("665412870", h="318d69683e", width=600)Task 3.2.19: Create a DataFrame df_pred_test that has two columns: "y_test" and "y_pred". The first should contain the true values for your test set, and the second should contain your model's predictions. Be sure the index of df_pred_test matches the index of y_test.
df_pred_test = pd.DataFrame({"y_test":y_test,"y_pred":model.predict(X_test)})| y_test | y_pred | |
|---|---|---|
| timestamp | ||
| 2018-12-07 17:00:00+03:00 | 7.070000 | 8.478927 |
| 2018-12-07 18:00:00+03:00 | 8.968333 | 7.865485 |
| 2018-12-07 19:00:00+03:00 | 11.630833 | 9.076421 |
| 2018-12-07 20:00:00+03:00 | 11.525833 | 10.774814 |
| 2018-12-07 21:00:00+03:00 | 9.533333 | 10.707836 |
VimeoVideo("665412891", h="39d7356a26", width=600)Task 3.2.20: Create a time series line plot for the values in test_predictions using plotly express. Be sure that the y-axis is properly labeled as "P2".
fig = px.line(df_pred_test)Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
Time Series: Statistical Models
Autoregression¶
Autoregression (AR) is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. AR works similarly to autocorrelation: in both cases, we're taking data from one part of a set and comparing it to another part. An AR model regresses itself.
Cleaning the Data¶
Just like with linear regression, we'll start by bringing in some tools to help us along the way.
warnings.simplefilter(action="ignore", category=FutureWarning)Since we'll be working with the "air-quality" data again, we need to connect to the server, start our client, and grab the data we need.
client = MongoClient(host="localhost", port=27017)Practice
Just to make sure we're all on the same page, import all those libraries and get your database up and running. Remember that even though all the examples use the Site 3 data from the lagos collection, the practice sets should use Site 4 data from the lagos collection. Call your database lagos_prac.
lagos_prac = ...In order to get our data into a form we can use to build our model, we're going to need to transform it in several key ways. The first thing we need to do is to get the data we need, and save the results in a DataFrame. Since we're interested in predicting the changes in air quality over time, let's set the DataFrame's index to "timestamp":
{"metadata.site": 3, "metadata.measurement": "P2"},Practice
Try it yourself! Create a list called results_prac that pulls data from Site 4 in the lagos data, then save it in a DataFrame called df_prac with the index "timestamp".
Localizing the Timezone¶
Because MongoDB stores all timestamps in UTC, we need to figure out a way to localize it. Having timestamps in UTC might be useful if we were trying to predict some kind of global trend, but since we're only interested in what's happening with the air in Lagos, we need to change the data from UTC to Africa/Lagos. Happily, pandas has a pair of tools to help us out: tz_localize and tz_convert. We use those methods to transform our data like this:
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")Resampling Data¶
The most important kind of data in our time-series model is the data that deals with time. Our "timestamp" data tells us when each reading was taken, but in order to create a good predictive model, we need the readings to happen at regular intervals. Our data doesn't do that, so we need to figure out a way to change it so that it does. The resample method does that for us.
Let's resample our data to create 1-hour reading intervals by aggregating using the mean:
df = df["P2"].resample("1H").mean().fillna(method="ffill").to_frame()Notice the second half of the code:
fillna(method="ffill").to_frame()
That tells the model to forward-fill any empty cells with imputed data. Forward-filling means that the model should start imputing data based on the closest cell that actually has data in it. This helps to keep the imputed data in line with the rest of the dataset.
Adding a Lag¶
We've spent some time elsewhere thinking about how two sets of data — apartment price and location, for example — compare to each other, but we haven't had any reason to consider how a dataset might compare to itself. If we're predicting the future, we want to know how good our prediction will be, so it might be useful to build some of that accountability into our model. To do that, we need to add a lag.
Lagging data means that we're adding a delay. In this case, we're going to allow the model to test itself out by comparing its predictions with what actually happened an hour before. If the prediction and the reality are close, then it's a good model; if they aren't, then the model isn't a very good one.
So, let's add a one-hour lag to our dataset:
# In `shift(1), the `1` is the lagged interval.Finally, let's drop our null values:
y = df["P2"].resample("1H").mean().fillna(method="ffill")Practice
Try it yourself! Clean the Site 2 data from lagos, and save it as a Series called y_prac.
df_prac["P2.L1"] = ...Exploring the Data¶
Time Series Line Plot¶
Example of
Creating an ACF Plot¶
Let's make an ACF plot using our y Series.
fig1, ax = plt.subplots(figsize=(15, 6))Each of the dots on our plot represents a correlation coefficient. The first data point in the top left of the graph tells us that at time-step 0, the correlation coefficient was 1, meaning that there was a perfect correlation. That makes sense, because you can't lag from time-step 0, so the coefficient can't be anything other than 1. But, starting at hour 1, the coefficient drops precipitously, and we see our autocorrelation coefficients slowly decay over time. As our lag recedes further into the past, the correlations break down; a prediction you made five hours ago about what's happening right now is going to be a lot more reliable than a prediction you made 96 hours ago.
The light blue shape across the bottom of the graph represents the confidence interval, or the extent to which we can be sure that our estimated correlations reflect the correlations that exist in reality. By default, this is set to 95%. Data points which fall either above or below the shape are likely not due to chance, and those which fall inside the shape are likely due to chance. It looks like all our data is the result of some kind of effect, so we're good to go.
Practice
Try it yourself! Make an ACF plot called fig2 using your y_prac Series.
fig2, ax = ...Creating a PACF Plot¶
Let's make a PACF plot using our y Series.
fig1, ax = plt.subplots(figsize=(15, 6))Aha! This looks very different. There are two things to notice here:
First, we now have lots of data points that we can be relatively certain aren't due to chance, but we also have lots of data points inside the blue shape at the bottom, indicating that some of our data points are indeed due to chance. That's not necessarily a problem, but it's something useful to keep in mind.
Second, recognize that even though the amplitude of the points on our graph has been significantly reduced, the trend has remained essentially the same: Strong positive correlations at the beginning, with the effect decaying over time. We would expect to see that, because the farther out into the future our predictions go, the less accurate they become.
Practice
Try it yourself! Make an PACF plot using your y_prac Series.
fig2, ax = ...Working with Rolling Windows¶
Rolling window is an important concept for time series analysis. We first define a window size, like 7 days, three months, etc. Then we calculate some statistics taking data from each window sequentially throughout the time series. For example, if I want to calculate a three-month rolling sum with the time series data below:
| Month | sales |
|---|---|
| 2022-01 | 10 |
| 2022-02 | 20 |
| 2022-03 | 25 |
| 2022-04 | 15 |
| 2022-05 | 20 |
| 2022-06 | 30 |
The three-month rolling sum would be
| Rolling Months | Rolling sum sales |
|---|---|
| 2022-01,02,03 | 55 |
| 2022-02,03,04 | 60 |
| 2022-03,04,05 | 60 |
| 2022-04,05,06 | 65 |
Rolling window statistics are very helpful in smoothing noisy data when making time series predictions. Let's see it with an example. Since we're interested in making predictions about the air quality in Lagos, it would be helpful to understand the rolling average for the PM 2.5 readings with a line plot. To keep things manageable, we'll set our window-size to one week.
# `168` is the number of hours in a week.Even though there are lots of peaks and valleys here, we're starting to see an emerging trend.
We can make the same graph using pandas, like this:
Practice
Try it yourself! Make a line plot that shows the weekly rolling average of the P2 values in the Site 2 dataset.
Besides rolling sum and rolling average, rolling window statistics can be applied to a lot of other statistics depends on the problem you are facing. In the example below, when we use GARCH model to analyze stock prices, we can use rolling window to calculate standard deviation.
Splitting the Data in pandas¶
The last thing to do in our data exploration is to split our data into training and test sets. For linear regression, we used an 80/20 split, where we used 80% of the data was our training set, and 20% of it was our test set. This time, we're going to expand the test set to 95%, and decrease the test set to %5 to bring it into line with statsmodels default confidence interval. This is important, because we'll need to use as much training data as we can if our model is going to accurately predict what's going to come next.
cutoff_test1 = int(len(y) * 0.95)Practice
Try it yourself! Create a cutoff called cutoff_test2, split the y_pracSeries into training and test sets, making sure to set the cutoff to 0.95.
cutoff_test2 = ...Building the Model¶
Baseline¶
First, let's calculate the baseline MAE for our model.
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)Practice
Try it yourself! Calculate the baseline mean and MAE for the y_prac Series.
y_prac_pred_baseline = ...Iterating¶
Before we can go any further, we need to instantiate an autoregression model based on our y training data. We'll call the model model.
model = AutoReg(y_train, lags=24, old_names=False).fit()Notice that, unlike our linear regression model which we built using scikit-learn, we're combining instantiation and fitting into one step; statsmodels includes that ability in its AutoReg method.
Practice
Try it yourself! Create and fit an autoregression model called model_prac.
model_prac = ...Autoregression models need us to generate in-sample predictions in order to calculate the MAE of our training data. In-sample predictions use data that's already part of our sample. That's to distinguish it from out-of-sample predictions, which we'll talk about a bit later. The statsmodels library includes a method called predict that can help us here. Above, the AutoReg method includes this line:
old_names=False
The False value here tells the model that it can use in-sample lagged values to make predictions; if the value had been True, the model would have to look elsewhere to make its predictions.
Here's how to generate in-sample predictions:
y_pred = model.predict().dropna()Once we've done that, we can calculate the MAE of the predictions in our training set.
training_mae = mean_absolute_error(y_train.loc[y_pred.index], y_pred)Practice
Try it yourself! Generate in-sample predictions using y_prac, and find the MAE for your y_prac training data. Print the result.
y_prac_train.loc[y_prac_pred.index], y_prac_predResiduals¶
We're going to use our model's residuals to make some visualizations, but first, we need to calculate what those residuals are.
y_train_resid = y_train - y_predNow we can make a line plot of our model's residuals.
fig1, ax = plt.subplots(figsize=(15, 6))The ideal residual plot has a random set of datapoints spread evenly on both sides of the line. The plot we just made actually looks pretty good; there are some significant outliers, but, on the whole, the bars describe an even band of values, which is what we're looking for.
Practice
Try it yourself! Calculate the residuals for y_prac and visualize them on a line plot called fig2.
y_prac_train_resid = ...Let's also take a look at a histogram of the residuals to help us see how they're distributed.
y_train_resid.hist();Remember, when we make histograms, we're trying to answer two questions:
1.) Is it a normal distribution? 2.) Are there any outliers?
For our histogram, that middle bar is pretty tall, but the shape described by all the bars looks like a normal distribution (albeit a stretched one), so the answer to the first question is "yes." Outliers are values that fall beyond the shape of a normal distribution, and it doesn't look like we have any of those, so the answer to the second question is "no." Those are the answers we're looking for, so let's move on to the next step.
ACF Plots¶
We're going to make an ACF plot to see how much variation there is in the dataset.
plot_acf(y_train_resid.dropna(), ax=ax);At first, this might seem wrong, but we're actually looking for a mostly-flat graph here. This is an indication that our model describes all the seasonality, or regular changes, in our data. In other words, this graph is exactly what we're looking for.
Practice
Try it yourself! Calculate the make a histogram and an ACF plot of the y_prac data.
Evaluating the Model¶
Now that we've built an autoregression model that seems to be working pretty well, it's time to evaluate it. We've already established that the model works well when compared to itself, but what about how well it works when we start looking outside our original dataset?
y_pred_test = model.predict(y_test.index.min(), y_test.index.max())Now that we have a prediction, we can calculate the MAE of our out-of-sample data.
test_mae = mean_absolute_error(y_test, y_pred_test)Practice
Try it yourself! Generate out-of-sample predictions using your y_prac data and model_prac, calculate the MAE, and print the result.
y_prac_test.index.min(), y_prac_test.index.max()Now that we have some out-of-sample predictions, we can compare it to our in-sample predictions using a line plot. The first step there is to create a new DataFrame called test1_predictions with two columns: one for the y_test data (the true data) and one for the y_pred (the predicted data). It's always a good idea to print the first five rows of a new DataFrame, just to make sure it looks the way it should.
{"y_test": y_test, "y_pred": y_pred_test}, index=y_test.indexThat looks correct, so we can move on to our line plot.
fig = px.line(test1_predictions, labels={"value": "P2"})This looks kind of strange, but it's actually exactly what we would expect to see. At the beginning, the y_pred data has a fair amount of predictive power, but, as time goes on, the predictions become less and less accurate. It's kind of like what happened with our ACF plots, only in reverse. Last time, the model lost its predictive power as the lag increased. Here, the model loses its predictive power as the horizon — how far away from the present your predictions are — increases. But don't worry! We'll fix it in a second.
Practice
In the meantime, try it yourself! Make a DataFrame with columns for y_prac_test and y_prac_pred, and print the result. Then, make a line plot that shows the relationship between the two variables.
Walk-forward Validation¶
Our predictions lose power over time because the model gets farther and farther away from its beginning. But what if we could move that beginning forward with the model? That's what walk-forward validation is. In a walk-forward validation, we re-train the model at for each new observation in the dataset, dropping the data that's the farthest in the past. Let's say that our prediction for what's going to happen at 12:00 is based on what happened at 11:00, 10:00, and 9:00. When we move forward an hour to predict what's going to happen at 1:00, we only use data from 10:00, 11:00, and 12:00, dropping the data from 9:00 because it's now too far in the past. Let's see how it works.
# Then, we define a variable that takes into account what's happened in the pastYou'll notice that we're using the same AutoReg method we used before, only this time, we're using the y_train data. Also like before, the 24 is telling the model how many hours it should pay attention to. If you change that number, the MAE will change too.
Speaking of the MAE, let's calculate it and see what we've got.
print("Test MAE 1 (walk forward validation):", round(test1_mae, 2))Practice
Try it yourself! Perform a walk-forward validation of your model using the y_prac_train data. Then, calculate the MAE and print the result. Note that because we're using %%capture in the validation cell, you'll need to create a new cell for your MAE calculation.
y_prac_pred_wfv = ...test2_mae = ...Communicating the Results¶
In machine learning, the model's parameters are the parts of the model that are learned from the training data. There are also hyperparameters, which we'll discuss in the next module. For now, just know that parameters come from inside the model, and hyperparameters are specified outside the model.
So, let's print the parameters of our validated model and see what it looks like.
print(model.params)That looks pretty good, but showing it in a line plot would be much better.
{"y_test": y_test, "y_pred": y_pred_wfv}, index=y_test.indexThat looks much better! Now our predictions are actually tracking the test data, just like they did in the linear regression model.
Practice
Try it yourself! Access the parameters of model_prac, put y_prac_test and y_prac_pred_wfv into the test2_predictions DataFrame, and create a line plot using plotly express.
ARMA Models & Hyperparameters¶
ARMA stands for Auto Regressive Moving Average, and it's a special kind of time-series analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them endogenous shocks — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust.
Regardless of the size of the shock, ARMA models can still predict the future. All we need to make that work is data.
Cleaning the Data¶
As always, we need to import all the tools we'll need to make our model.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacfAnd then we need to get our database client up and running.
client = MongoClient(host="localhost", port=27017)Then, we need to clean our data. All the examples will use data from Site 3; all the practice sets will use Site 2. If you need a refresher on how all those methods work, refer back to the Autoregression notebook.
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Lagos")Practice
Try it yourself! Get your client up and running and call your database db_prac. Create a variable called results_prac, and read in a collection called lagos_prac using data from Site 2. Save it as a Series called y_prac.
df_prac["P2.L1"] = ...Exploring the Data¶
Just like we did with AR, we'll start by exploring the data. Let's make a histogram.
y.hist();Practice
Try it yourself! Make a histogram using y_prac.
This is what the data looks like when our sample is 1-hour intervals, but we might want to be able to quickly change our sample to other intervals of time. First, we'll create a function called wrangle, and then add an argument. In Python, arguments tell the function what to do. This function already has an argument called collection, so we'll need to add another to make resampling work. We'll call that argument resamp_pd. WQU WorldQuant University Applied Data Science Lab QQQQ
# Here's where the new argument goes. We're setting the default value to `"1H"`.Now let's change "1H" to "1D" and see what happens.
y = wrangle(lagos, resamp_pd="1D")As you can see on the left side of the table, the samples are now at one day intervals, which is exactly what we wanted!
Let's make a new histogram to see if changing the sampling interval made a difference in the data.
y.hist();This looks pretty different! It's always nice to have a diversified dataset.
Practice
Try it yourself! Define a function called wrangle_prac run it, and print the results of y_prac. Then, create a new histogram from y_prac.
print(y_prac) Like with our AR model, we need to create ACF and PACF plots to see what's happening with the correlation coefficients.
fig, ax = plt.subplots(figsize=(15, 6))And now let's make a PACF plot.
fig, ax = plt.subplots(figsize=(15, 6))Practice
Try it yourself! Make a PAC and a PACF plot using your y_prac data.
Splitting the Data¶
In our AR model, we split our data based on the number of observations we wanted to investigate. This time, we're going to split our data based on the date, using just the readings from October 2018. So, just like we did before, we'll create a training set using y, but instead of using percentages to split the data, we'll use dates.
# Notice that the date format is `YYYY-MM-DD`Practice
Try it yourself! Create a training dataset called y_prac_train based on November 2017. (Hint: there are 30 days in November.)
y_prac_train = ...Building the Model¶
Baseline¶
The first thing we need to do is calculate the MAE for our new model:
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)Practice
Try it yourself! Calculate the mean and MAE for the y_prac Series, and print the results.
y_prac_pred_baseline = ...Iterating¶
So far, the only difference between our old AR model and the new ARMA model we're building is that the new model's data is based on the date rather than on the length of the variable. But the difference between AR and ARMA is the addition of hyperparameters.
Hyperparameters¶
Let's set our p values to include values from 0 to 25, moving in steps of 8:
p_params = range(0, 25, 8)And let's set our q values to include values from 0 to 3, moving in steps of 1:
q_params = range(0, 3)Practice
Using p_params_prac, set the p value to include vales from 1 to 4, moving in steps of 1. Then, using q_params_prac, set the q value to include values from 0 to 3, moving in steps of 1.
p_params_prac = ...In order to tell the model to keep going through all the possible combinations, we'll add in a pair of for loops. (If you need a refresher on for loops, refer to Notebook 001.)
# Here's where we tell the model how we want it to deal with timePractice
Try it yourself! Create an ARMA model called mode_prac2 based on a dictionary called maes_prac, using your training and test data, then print the results and append the MAE to maes_prac.
Now that we have a working ARMA model, let's turn the output into a DataFrame so we can see what happened.
mae_grid = pd.DataFrame(maes)And let's visualize the DataFrame in a heatmap. (If you need a refresher on how to create a heatmap in seaborn, refer to Notebook 008.)
plt.title("Grid Search (Criterion: MAE)");Practice
Try it yourself! Turn read the output of your ARMA model into a DataFrame called mae_grid_prac, and visualize it in a heatmap.
mae_grid_prac = ...It looks like our MAE values are in the right place, but let's try some other ways to explore our new model using the plot_diagnostics method.
fig, ax = plt.subplots(figsize=(15, 12))As usual, we have quite a lot to sift through here. The first graph is showing us our model's residuals. Ideally, we'd like to see this be as close to zero as possible, and this graph is telling us that, for the most part, we have a good model.
The next graph over shows us another version of the same thing. The histogram is similar to the one we made before, but there's a pair of lines superimposed. These lines are indicating the kernel density, which is another way of saying that it's a smoothed-out version of the blue histogram bars. The green line represents a normal distribution, and is included here just to give you something to compare to the values from your model. The orange line represents the smoothed-out version of the result off your model. Our model is actually pretty close to a normal distribution, so that's good!
The Q-Q plot on the bottom left is yet another way to visualize the same thing. Here, the red line is showing us a perfect 1:1 correlation between our variables, and the wavy blue line is showing us what we actually have. Again, our model is pretty close to the red line, so it's looking good.
And finally, we have a correlogram, which might look familiar; it's the same kind of plot as the ACF and PACF plots from our AR models.
Practice
Try it yourself! Use plot_diagnostics to examine the residuals from model_prac.
Communicating the Results¶
Now that we have an ARMA model that seems to be working well, it's time to communicate the results of our analysis in a line graph. Let's create a graph that shows the relationship between our training and predicting data. (For a refresher on how to do this and what it means, refer to the AR notebook.)
{"y_train": y_train, "y_pred": y_train_pred}, index=y_train.indexPractice
Try it yourself! Create a line plot that compares the y_prac_train and y_prac_pred values.
ARCH and GARCH Models¶
ARCH stands for autoregressive conditionally heteroscedastic, which models the variance of a time series. ARCH model assumes variance at time
YouTubeVideo("Li95a2biFCU")YouTubeVideo("inoBpq1UEn4")Let's see an example of using GARCH model in forecasting Apple stock price volatility. We first import the data.
df = pd.read_csv("./data/AAPL.csv", parse_dates=["date"], index_col="date")The dataset range from 1999 to 2022. For simplicity, we only take 5 years of data.
df = df["2015-10-13":]Calculating Returns¶
The next step is to calculate the volatility of the stock close prices. Here we can use the pct_change function from pandas to calculate the daily percentage change, then multiply the result by 100 to get the returns:
df["return"] = df["close"].pct_change() * 100We can then take out the return column as our training data. Note the first observation is missing since there is no past value to observe. We need to drop it:
AAPL_return = df["return"].dropna()We can check the histogram to see the distribution of the returns over the past five years:
plt.title("Distribution of AAPL Daily Returns");There's a negative outlier in this date range, with the idxmin function, we find out it was in 31 August 2020. The stock price has a huge drop due to a stock split.
AAPL_return.idxmin(), AAPL_return.min()We can also check the standard deviation of the whole dataset with the pandas std() function:
AAPL_return.std()To see the statistic more clearly, we can use some time series plots. In addition to the Apple stock price return plot, we can also plot rolling volatility of the return to smooth out the noise, and see how return and volatility are associated with each other. First we can calculate a 50-day rolling standard deviation series for Apple stock price return:
AAPL_return_rolling_50d_volatility = AAPL_return.rolling(window=50).std().dropna()We can plot these two series:
ax=ax, label="50d rolling volatility", linewidth=3Here we can see that volatility goes up when the returns change drastically up or down. For instance, when return dropped enormously in 2020 August, volatility also increased a lot. However, this plot reveals a problem. We want to use returns to see if high volatility on one day is associated with high volatility on the following day. But high volatility is caused by large changes in returns, which can be either positive or negative. How can we assess negative and positive numbers together without them canceling each other out? The common solution used in building ARCH or GARCH models are to square the returns. Let's plot the squared returns:
(AAPL_return**2).plot(ax=ax, label="daily return") Now we it much easier to see groups high and low volatility and build a GARCH model. To build a GARCH model, we need to figure out both the p and q parameters. The p parameter is handling correlations at prior time steps and the q parameter deals with prior variances, like shocks. It also uses the notion of lag. To see how many lags we should have in our model, we should create an ACF and PACF plot — but using the squared returns.
fig, ax = plt.subplots(figsize=(15, 6))fig, ax = plt.subplots(figsize=(15, 6))Both the ACF and PACF graph show one lag is enough to build the model.
Building the Model¶
model = arch_model(AAPL_return, p=1, q=1, rescale=False).fit(disp=0)Common Metrics¶
The model summary provides a lot of information about the model, like the trained parameters, model performance, etc. AIC and BIC are important measurements for model performance. AIC stands for Akaike Information Criterion, it measures the goodness of fit of any estimated statistical model. BIC is Bayesian Information Criteria that selects model from a finite set of models. When fitting models, it is possible to increase model performance by adding more parameters, which would also result in overfitting. The BIC resolves this problem by introducing a penalty term for the number of parameters in the model. The penalty term is larger in BIC than in AIC. Though BIC is always higher than AIC, lower the value of these two measures, better the model.
Standardized Residuals¶
After fitting the model with the data, we can get also get the residuals to see how the model performs. The residual at time
model.std_resid.plot(ax=ax, label="Standardized Residuals")The standardized residuals are moving around 0 except for the outlier events from 2020 August. Let's check the normality by the histogram:
plt.title("Distribution of Standardized Residuals");If we exclude the outlier, the other standardized residuals do follow a normal distribution with mean 0.
Evaluation¶
We can further evaluate the model by comparing its forecast with a subset of the observed returns to see whether the model has successfully captured the volatility. We can first check the model conditional volatility in its confidence interval throughout the dataset:
(-2 * model.conditional_volatility.rename("")).plot(ax=ax, color="C1")We can then select the test set of the data and do a walk-forward validation on the GARCH model:
next_pred = model.forecast(horizon=1, reindex=False).variance.values[0][0] ** 0.5We can plot the predicted volatility versus return in this test set, and see the model has captured most of the variance here:
(2 * y_test_wfv).plot(ax=ax, c="C1", label="2 SD Predicted Volatility")Forecasting¶
The last step is to make the forecasts with our trained model. Let's make a one-day forecast first to see how GARCH model forecasting works:
one_day_forecast = model.forecast(horizon=1, reindex=False).varianceThere are two things we need to keep in mind here. First, our model forecast shows the predicted variance, not the standard deviation / volatility. So we'll need to take the square root of the value. Second, the prediction's index is in the format of "h.1" as the next date of the last training data date. It's not in the form of DatetimeIndex that we desire. So we need to format the forecasts to be more readable.
We first generate 5 days predictions with our trained GARCH model:
prediction_dates = pd.bdate_range(start=start, periods=prediction.shape[1])Then we define a function to clean the predictions to be more readable:
prediction_formatted = pd.Series(data, index=prediction_index)Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
3.3. Autoregressive Models
xxxxxxxxxximport warningsimport matplotlib.pyplot as pltimport pandas as pdimport plotly.express as pxfrom IPython.display import VimeoVideofrom pymongo import MongoClientfrom sklearn.metrics import mean_absolute_errorfrom statsmodels.graphics.tsaplots import plot_acf, plot_pacffrom statsmodels.tsa.ar_model import AutoRegwarnings.simplefilter(action="ignore", category=FutureWarning)/opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import (to_datetime, Int64Index, DatetimeIndex, Period, /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
VimeoVideo("665851858", h="e39fc3d260", width=600)1. Prepare Data¶
1.1. Import¶
VimeoVideo("665851852", h="16aa0a56e6", width=600)Task 3.3.1: Complete to the create a client to connect to the MongoDB server, assigns the "air-quality" database to db, and assigned the "nairobi" connection to nairobi.
client = MongoClient(host="localhost", port=27017)db = client["air-quality"]nairobi = db["nairobi"]VimeoVideo("665851840", h="e048425f49", width=600)Task 3.3.2: Change the wrangle function below so that it returns a Series of the resampled data instead of a DataFrame.
def wrangle(collection): results = collection.find( {"metadata.site": 29, "metadata.measurement": "P2"}, projection={"P2": 1, "timestamp": 1, "_id": 0}, ) # Read data into DataFrame df = pd.DataFrame(list(results)).set_index("timestamp") # Localize timezone df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi") # Remove outliers df = df[df["P2"] < 500] # Resample to 1hr window y = df["P2"].resample("1H").mean().fillna(method='ffill') return yTask 3.3.3: Use your wrangle function to read the data from the nairobi collection into the Series y.
y = wrangle(nairobi)y.head()timestamp 2018-09-01 03:00:00+03:00 17.541667 2018-09-01 04:00:00+03:00 15.800000 2018-09-01 05:00:00+03:00 11.420000 2018-09-01 06:00:00+03:00 11.614167 2018-09-01 07:00:00+03:00 17.665000 Freq: H, Name: P2, dtype: float64
# Check your workassert isinstance(y, pd.Series), f"`y` should be a Series, not type {type(y)}"assert len(y) == 2928, f"`y` should have 2928 observations, not {len(y)}"assert y.isnull().sum() == 01.2. Explore¶
VimeoVideo("665851830", h="85f58bc92b", width=600)#y.corr(y) #correlation between y it self is 1#y.corr(y.shift(1)) #correlation between y its one hour lag value is 0.65#y.corr(y.shift(2)) #correlation between y its one hour lag value is 0.4#y.corr(y.shift(3)) #correlation between y its one hour lag value is 0.3#we can find the same correlations using ACF plotTask 3.3.4: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".
fig, ax = plt.subplots(figsize=(15, 6))plot_acf(y,ax=ax)plt.xlabel("Lag [hours]")plt.ylabel("Correlation Coefficient");xxxxxxxxxxVimeoVideo("665851811", h="ee3a2b5c24", width=600)Task 3.3.5: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".
xxxxxxxxxxfig, ax = plt.subplots(figsize=(15, 6))plot_pacf(y,ax=ax)plt.xlabel("Lag [hours]")plt.ylabel("Correlation Coefficient");1.3. Split¶
xxxxxxxxxxVimeoVideo("665851798", h="6c191cd94c", width=600)Task 3.3.6: Split y into training and test sets. The first 95% of the data should be in your training set. The remaining 5% should be in the test set.
xxxxxxxxxxcutoff_test = int(len(y)*0.95)y_train = y.iloc[:cutoff_test]y_test = y.iloc[cutoff_test:]xxxxxxxxxxlen(y_train)+len(y_test)==len(y)True
2. Build Model¶
2.1. Baseline¶
Task 3.3.7: Calculate the baseline mean absolute error for your model.
xxxxxxxxxxy_train_mean = y_train.mean()y_pred_baseline = [y_train_mean] * len(y_train)mae_baseline = mean_absolute_error(y_train, y_pred_baseline)print("Mean P2 Reading:", round(y_train_mean, 2))print("Baseline MAE:", round(mae_baseline, 2))Mean P2 Reading: 9.22 Baseline MAE: 3.71
2.2. Iterate¶
xxxxxxxxxxVimeoVideo("665851769", h="94a4296cde", width=600)Task 3.3.8: Instantiate an AutoReg model and fit it to the training data y_train. Be sure to set the lags argument to 26.
xxxxxxxxxxmodel = AutoReg(y_train,lags=26).fit()xxxxxxxxxxVimeoVideo("665851746", h="1a4511e883", width=600)Task 3.3.9: Generate a list of training predictions for your model and use them to calculate your training mean absolute error.
xxxxxxxxxxy_pred = model.predict().dropna()training_mae = mean_absolute_error(y_train.iloc[26:],y_pred)print("Training MAE:", training_mae)Training MAE: 2.2809871656467036
xxxxxxxxxxVimeoVideo("665851744", h="60d053b455", width=600)Task 3.3.10: Use y_train and y_pred to calculate the residuals for your model.
xxxxxxxxxx#y_train_resid = y_train-y_pred #which is equal to or same results with below codey_train_resid = model.residy_train_resid.tail()timestamp 2018-12-25 19:00:00+03:00 -0.392002 2018-12-25 20:00:00+03:00 -1.573180 2018-12-25 21:00:00+03:00 -0.735747 2018-12-25 22:00:00+03:00 -2.022221 2018-12-25 23:00:00+03:00 -0.061916 Freq: H, dtype: float64
xxxxxxxxxxVimeoVideo("665851712", h="9ff0cdba9c", width=600)Task 3.3.11: Create a plot of y_train_resid.
xxxxxxxxxxfig, ax = plt.subplots(figsize=(15, 6))y_train_resid.plot(ylabel="Residual Value",ax=ax)<AxesSubplot:xlabel='timestamp', ylabel='Residual Value'>
xxxxxxxxxxVimeoVideo("665851702", h="b494adc297", width=600)Task 3.3.12: Create a histogram of y_train_resid.
xxxxxxxxxxy_train_resid.hist()plt.xlabel("Residual Value")plt.ylabel("Frequency")plt.title("AR(26),Distribution of Residuals")Text(0.5, 1.0, 'AR(26),Distribution of Residuals')
xxxxxxxxxxVimeoVideo("665851684", h="d6d782a1f3", width=600)Task 3.3.13: Create an ACF plot of y_train_resid.
xxxxxxxxxxfig, ax = plt.subplots(figsize=(15, 6))plot_acf(y_train_resid,ax=ax)2.3. Evaluate¶
xxxxxxxxxxVimeoVideo("665851662", h="72e767e121", width=600)xxxxxxxxxx# #method showed in video to work on test data and check mean_absolute_error# y_test.index.min()#to get the first value in test data using index# y_test.index.max()##to get the first value in test data using index# y_pred_test = model.predict(y_test.index.min(),y_test.index.max()).dropna()# test_mae = mean_absolute_error(y_test,y_pred_test)# print("Test MAE:", test_mae)xxxxxxxxxx#method used in training modelmodel = AutoReg(y_test,lags=26).fit()Task 3.3.14: Calculate the test mean absolute error for your model.
xxxxxxxxxxy_pred_test = model.predict().dropna()test_mae = mean_absolute_error(y_test.iloc[26:],y_pred_test)print("Test MAE:", test_mae)Test MAE: 1.1581601912057071
Task 3.3.15: Create a DataFrame test_predictions that has two columns: "y_test" and "y_pred". The first should contain the true values for your test set, and the second should contain your model's predictions. Be sure the index of test_predictions matches the index of y_test.
- Create a DataFrame from a dictionary using pandas.WQU WorldQuant University Applied Data Science Lab QQQQ
xxxxxxxxxxdf_pred_test = pd.DataFrame( {"y_test": y_test, "y_pred": y_pred_test}, index=y_test.index)xxxxxxxxxxVimeoVideo("665851628", h="29b43e482e", width=600)Task 3.3.16: Create a time series plot for the values in test_predictions using plotly express. Be sure that the y-axis is properly labeled as "P2".
xxxxxxxxxxVimeoVideo("665851599", h="bb30d96e43", width=600)Task 3.3.17: Perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv.
xxxxxxxxxx%%capturey_pred_wfv = pd.Series()history = y_train.copy()for i in range(len(y_test)): model = AutoReg(history,lags=26).fit() next_pred = model.forecast() y_pred_wfv = y_pred_wfv.append(next_pred) history = history.append(y_test[next_pred.index]) xxxxxxxxxxlen(y_pred_wfv)==len(y_test)True
xxxxxxxxxxhistory.tail(1)2019-01-01 02:00:00+03:00 18.803333 Freq: H, Name: P2, dtype: float64
xxxxxxxxxxxxxxxxxxxxmodel = AutoReg(history,lags=26).fit()xxxxxxxxxxmodel.forecast()2019-01-01 03:00:00+03:00 14.110356 Freq: H, dtype: float64
xxxxxxxxxxy_test.head(1)timestamp 2018-12-26 00:00:00+03:00 5.679091 Freq: H, Name: P2, dtype: float64
xxxxxxxxxxxxxxxxxxxxVimeoVideo("665851568", h="a764ab5416", width=600)Task 3.3.18: Calculate the test mean absolute error for your model.
xxxxxxxxxxtest_mae = mean_absolute_error(y_test,y_pred_wfv)print("Test MAE (walk forward validation):", round(test_mae, 2))Test MAE (walk forward validation): 1.4
3. Communicate Results¶
xxxxxxxxxxVimeoVideo("665851553", h="46338036cc", width=600)Task 3.3.19: Print out the parameters for your trained model.
xxxxxxxxxx#unlike in scikit learn finding intercept and coefficients in statsmodels is very easy. below code will give those intercept and coefficient valuesprint(model.params)intercept 2.021763 P2.L1 0.587991 P2.L2 0.019749 P2.L3 0.023426 P2.L4 0.026807 P2.L5 0.044019 P2.L6 -0.101889 P2.L7 0.029814 P2.L8 0.049894 P2.L9 -0.017513 P2.L10 0.032585 P2.L11 0.064034 P2.L12 0.005978 P2.L13 0.018366 P2.L14 -0.007634 P2.L15 -0.016571 P2.L16 -0.015587 P2.L17 -0.035295 P2.L18 0.000929 P2.L19 -0.003956 P2.L20 -0.020668 P2.L21 -0.012730 P2.L22 0.052431 P2.L23 0.074405 P2.L24 -0.024263 P2.L25 0.090515 P2.L26 -0.088646 dtype: float64
xxxxxxxxxxVimeoVideo("665851529", h="39284d37ac", width=600)Task 3.3.20: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don't forget the index). Then plot df_pred_test using plotly express.
xxxxxxxxxxdf_pred_test = pd.DataFrame( {"y_test":y_test,"y_pred_wfv":y_pred_wfv})fig = px.line(df_pred_test,labels={"Value":"PM2.5"})fig.show()Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
3.4. ARMA Models
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf/opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import (to_datetime, Int64Index, DatetimeIndex, Period, /opt/conda/lib/python3.9/site-packages/statsmodels/tsa/base/tsa_model.py:7: FutureWarning: pandas.Float64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
VimeoVideo("665851728", h="95c59d2805", width=600)Prepare Data¶
Import¶
Task 3.4.1: Create a client to connect to the MongoDB server, then assign the "air-quality" database to db, and the "nairobi" collection to nairobi.
client = MongoClient(host="localhost",port=27017) df.index = df.index.tz_localize("UTC").tz_convert("Africa/Nairobi")VimeoVideo("665851670", h="3efc0c20d4", width=600)Task 3.4.2: Change your wrangle function so that it has a resample_rule argument that allows the user to change the resampling interval. The argument default should be "1H".
), f"Your function should take two arguments: `'collection'`, `'resample_rule'`. Your function takes the following arguments: {func_params}"Task 3.4.3: Use your wrangle function to read the data from the nairobi collection into the Series y.
y = wrangle(nairobi,resample_rule="1H")timestamp 2018-09-01 03:00:00+03:00 17.541667 2018-09-01 04:00:00+03:00 15.800000 2018-09-01 05:00:00+03:00 11.420000 2018-09-01 06:00:00+03:00 11.614167 2018-09-01 07:00:00+03:00 17.665000 Freq: H, Name: P2, dtype: float64
), f"There should be no null values in `y`. Your `y` has {y.isnull().sum()} null values."Explore¶
VimeoVideo("665851654", h="687ff8d5ee", width=600)Task 3.4.4: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".
plt.ylabel("Correlation coefficient")Text(0, 0.5, 'Correlation coefficient')
VimeoVideo("665851644", h="e857f05bfb", width=600)Task 3.4.5: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient".
fig,ax = plt.subplots(figsize=(15,6))Text(0, 0.5, 'Correlation coefficient')
Split¶
Task 3.4.6: Create a training set y_train that contains only readings from October 2018, and a test set y_test that contains readings from November 1, 2018.
y_test = y.get("November 1, 2018")print(len(y_test))24
assert len(y_test) == 24, f"`y_test` should have 24 observations, not {len(y_test)}."Build Model¶
Baseline¶
Task 3.4.7: Calculate the baseline mean absolute error for your model.
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)Mean P2 Reading: 10.12 Baseline MAE: 4.17
Iterate¶
VimeoVideo("665851576", h="36e2dc6269", width=600)Task 3.4.8: Create ranges for possible p_params should range between 0 and 25, by steps of 8. q_params should range between 0 and 3 by steps of 1.
p_params = range(0,25,8)[0, 8, 16, 24]
VimeoVideo("665851476", h="d60346ed30", width=600)Task 3.4.9: Complete the code below to train a model with every combination of hyperparameters in p_params and q_params. Every time the model is trained, the mean absolute error is calculated and then saved to a dictionary. If you're not sure where to start, do the code-along with Nicholas!
# Create key-value pair in dict. Key is `p`, value is empty list.Trained ARIMA (0, 0, 0) in 0.39 seconds.
Trained ARIMA (0, 0, 1) in 0.4 seconds.
Trained ARIMA (0, 0, 2) in 1.09 seconds.
Trained ARIMA (8, 0, 0) in 9.9 seconds.
Trained ARIMA (8, 0, 1) in 38.9 seconds.
Trained ARIMA (8, 0, 2) in 75.5 seconds.
Trained ARIMA (16, 0, 0) in 36.61 seconds.
Trained ARIMA (16, 0, 1) in 126.9 seconds.
Trained ARIMA (16, 0, 2) in 187.7 seconds.
Trained ARIMA (24, 0, 0) in 108.79 seconds.
Trained ARIMA (24, 0, 1) in 156.42 seconds.
Trained ARIMA (24, 0, 2) in 306.1 seconds.
{0: [4.171460443827197, 3.3506427433555537, 3.1057222587888647], 8: [2.9384480570404223, 2.9149008203151663, 2.908046094677133], 16: [2.9201084726122, 2.929436207635203, 2.913638894139686], 24: [2.914390327277196, 2.9136013252035013, 2.89792511429827]}
VimeoVideo("665851464", h="12f4080d0b", width=600)Task 3.4.10: Organize all the MAE's from above in a DataFrame names mae_df. Each row represents a possible value for
mae_df = pd.DataFrame(mae_grid)| 0 | 8 | 16 | 24 | |
|---|---|---|---|---|
| 0 | 4.1715 | 2.9384 | 2.9201 | 2.9144 |
| 1 | 3.3506 | 2.9149 | 2.9294 | 2.9136 |
| 2 | 3.1057 | 2.9080 | 2.9136 | 2.8979 |
VimeoVideo("665851453", h="dfd415bc08", width=600)Task 3.4.11: Create heatmap of the values in mae_grid. Be sure to label your x-axis "p values" and your y-axis "q values".
sns.heatmap(mae_df,cmap="Blues")Text(33.0, 0.5, 'q values')
VimeoVideo("665851444", h="8b58161f26", width=600)Task 3.4.12: Use the plot_diagnostics method to check the residuals for your model. Keep in mind that the plot will represent the residuals from the last model you trained, so make sure it was your best model, too!
fig, ax = plt.subplots(figsize=(15, 12))Evaluate¶
VimeoVideo("665851439", h="c48d80cdf4", width=600)Task 3.4.13: Complete the code below to perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv. Choose the values for
history = history.append(y_test[next_pred.index])print("Test MAE (walk forward validation):", round(test_mae, 2))Communicate Results¶
VimeoVideo("665851423", h="8236ff348f", width=600)Task 3.4.14: First, generate the list of training predictions for your model. Next, create a DataFrame df_predictions with the true values y_test and your predictions y_pred_wfv (don't forget the index). Finally, plot df_predictions using plotly express. Make sure that the y-axis is labeled "P2".
fig = px.line(df_predictions,labels={"Value":"PM2.5"})Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
Pandas: Advanced
Calculate Summary Statistics for a DataFrame or Series¶
Many datasets are large, and it can be helpful to get a summary of information in them. Let's load a dataset and examine the first few rows:
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")Let's get a summary description of this dataset.
mexico_city1.describe()Like most large datasets, this one has many values which are missing. The describe function will ignore missing values in each column. You can also remove rows and columns with missing values, and then get a summary of the data that's still there. We need to remove columns first, before removing the rows; the sequence of operations here is important. The code looks like this:
["floor", "price_usd_per_m2", "expenses", "rooms"], axis=1Let's take a look at our new, cleaner dataset.
mexico_city1.describe()Practice
Reload the mexico-city-real-estate-1.csv dataset. Reverse the sequence of operations by first dropping all rows where there is a missing value, and then dropping the columns, floor, price_usd_per_m2,expenses and rooms. What is the size of the resulting DataFrame?
mexico_city1 = pd.read_csv("./data/mexico-city-real-estate-1.csv")Select a Series from a DataFrame¶
Since the datasets we work with are so large, you might want to focus on a single column of a DataFrame. Let's load up the mexico-city-real-estate-2 dataset, and examine the first few rows to find the column names.
mexico_city2 = pd.read_csv("./data/mexico-city-real-estate-2.csv")Maybe we're interested in the surface_covered_in_m2 column. The code to extract just that one column looks like this:
surface_covered_in_m2 = mexico_city2["surface_covered_in_m2"]Practice
Select the price series from the mexico-city-real-estate-2 dataset, and load it into the mexico_city2 DataFrame
print(price)Subset a DataFrame by Selecting One or More Columns¶
You may find it more efficient to work with a smaller portion of a dataset that's relevant to you. One way to do this is to select some columns from a DataFrame and make a new DataFrame with them. Let's load a dataset to do this and examine the first few rows to find the column headings:
mexico_city4 = pd.read_csv("./data/mexico-city-real-estate-4.csv")Let's choose "operation", "property_type", "place_with_parent_names", and "price":
["operation", "property_type", "place_with_parent_names", "price"]Once we've done that, we can find the resulting number of entries in the DataFrame:
mexico_city4_subset.shapePractice
Load the mexico-city-real-estate-1.csv dataset and subset it to obtain the operation, lat-lon and place_with_property_names columns only. How many entries are in the resulting DataFrame?
["operation", "lat-lon", "place_with_parent_names"]Subset the Columns of a DataFrame Based on Data Types¶
It's helpful to be able to find specific types of entries — typically numeric ones — and put all of these in a separate DataFrame. First, let's take a look at the mexico-city-real-estate-5 dataset.
mexico_city5 = pd.read_csv("./data/mexico-city-real-estate-5.csv")The code to subset just the numerical entries looks like this:
mexico_city5_numbers = mexico_city5.select_dtypes(include="number")Practice
Create a subset of the DataFrame from mexico-city-real-estate-5 which excludes numbers.
print(mexico_city3_no_numbers.shape)Working with value_counts in a Series¶
In order to use the data in a series for other types of analysis, it might be helpful to know how often each value occurs in the Series. To do that, we use the value_counts method to aggregate the data. Let's take a look at the number of properties associated with each department in the colombia-real-estate-1 dataset.
df1 = pd.read_csv("data/colombia-real-estate-1.csv", usecols=["department"])Practice
Try it yourself! Aggregate the different property types in the colombia-real-estate-2 dataset.
df2 = ...Series and Groupby¶
Large Series often include data points that have some attribute in common, but which are nevertheless not grouped together in the dataset. Happily, pandas has a method that will bring these data points together into groups.
Let's take a look at the colombia-real-estate-1 dataset. The set includes properties scattered across Colombia, so it might be useful to group properties from the same department together; to do this, we'll use the groupby method. The code looks like this:
dept_group = df1.groupby("department")To make sure we got all the departments in the dataset, let's print the first occurrence of each department.
dept_group.first()Practice
Try it yourself! Group the properties in colombia-real-estate-2 by department, and print the result.
dept_group.first()Now that we have all the properties grouped by department, we might want to see the properties in just one of the departments. We can use the get_group method to do that. If we just wanted to see the properties in "Santander", for example, the code would look like this:
dept_group1 = df1.groupby("department")We can also make groups based on more than one category by adding them to the groupby method. After resetting the df1 DataFrame, here's what the code looks like if we want to group properties both by department and by property_type.
dept_group2 = df1.groupby(["department", "property_type"])Practice
Try it yourself! Group the properties in colombia-real-estate-2 by department and property type, and print the result.
dept_group.first()Finally, it's possible to use groupby to calculate aggregations. For example, if we wanted to find the average property area in each department, we would use the .mean() method. This is what the code for that looks like:
dept_group = df1.groupby("department")["area_m2"].mean()Practice
Try it yourself! Use the .mean method in the colombia-real-estate-2 dataset to find the average price in Colombian pesos ("price_cop") for properties in each "department".
dept_group = ...Subset a DataFrame's Columns Based on the Column Data Types¶
It's helpful to be able to find entries of a certain type, typically numerical entries, and put all of these in a separate DataFrame. Let's load a dataset to see how this works:
mexico_city5 = pd.read_csv("./data/mexico-city-real-estate-5.csv")Now let's get only numerical entries:
mexico_city5_numbers = mexico_city5.select_dtypes(include="number")Let's now find all entries which are not numerical entries:
mexico_city5_no_numbers = mexico_city5.select_dtypes(exclude="number")Practice
Create a subset of the DataFrame from mexico-city-real-estate-5.csv which excludes numbers. How many entries does it have?
print(mexico_city3_no_numbers.shape)Subset a DataFrame's columns based on column names¶
To subset a DataFrame by column names, either define a list of columns to include or define a list of columns to exclude. Once you've done that, you can retain or drop the columns accordingly. For example, let's suppose we want to modify the mexico_city3 dataset and only retain the first three columns. Let's define two lists, one with the columns to retain and one with the columns to drop:
keep_cols = ["operation", "property_type", "place_with_parent_names"]Next, we'll explore both approaches to subset mexico_city3. To retain columns based on keep_cols:
mexico_city3_subsetted = mexico_city3[keep_cols]To drop columns in drop_cols:
mexico_city3_subsetted = mexico_city3.drop(columns=drop_cols)Practice
Create a subset of the DataFrame from mexico-city-real-estate-3.csv which excludes the last two columns.
Pivot Tables¶
A pivot table allows us to aggregate and summarize a DataFrame across multiple variables. For example, let's suppose we wanted to calculate the mean of the price column in the mexico_city3 dataset for the different values in the property_type column:
mexico_city3_pivot = ...Subsetting with Masks¶
Another way to create subsets from a larger dataset is through masking. Masks are ways to filter out the data you're not interested in so that you can focus on the data you are. For example, we might want to look at properties in Colombia that are bigger than 200 square meters. In order to create this subset, we'll need to use a mask.
First, we'll reset our df1 DataFrame so that we can draw on all the data in its original form. Then we'll create a statement and then assign the result to mask.
df1 = pd.read_csv("data/colombia-real-estate-1.csv")Notice that mask is a Series of Boolean values. Where properties are smaller than 200 square meters, our statement evaluates as False; where they're bigger than 200, it evaluates to True.
Once we have our mask, we can use it to select all the rows from df1 that evaluate as True.
df1[mask].head()Practice
Try it yourself! Read colombia-real-estate-2 into a DataFrame named df2, and create a mask to select all the properties that are smaller than 100 square meters.
df2[mask].head()We can also create masks with multiple criteria using & for "and" and | for "or." For example, here's how we would find all properties in Atlántico that are bigger than 400 square meters.
mask = (df1["area_m2"] > 400) & (df1["department"] == "Atlántico")Practice
Try it yourself! Create a mask for df2 to select all the properties in Tolima that are smaller than 150 square meters.
df2[mask].head()Reshape a DataFrame based on column values¶
What's a pivot table?¶
A pivot table allows you to quickly aggregate and summarize a DataFrame using an aggregation function. For example, to build a pivot table that summarizes the mean of the price_cop column for each of the unique categories in the property_type column in df2:
pivot1 = pd.pivot_table(df2, values="price_cop", index="property_type", aggfunc=np.mean)Practice
Build a pivot table that summarizes the mean of the price_cop column for each of the unique departments in the department column in df2:
pivot2 = pd.pivot_table(df2, values="price_cop", index="department", aggfunc=np.mean)Combine multiple categories in a Series¶
Categorical variables can be collapsed into a fewer number of categories. One approach is to retain the values of the most frequently observed values and collapse all remaining values into a single category. For example, to retain only the values of the top 10 most frequent categories in the department column and then collapse the other categories together, use value_counts to generate the count of the values:
df2["department"].value_counts()Next, select just the top 10 using the head() method, and select the column names by using the index attribute of the series:
top_10 = df2["department"].value_counts().head(10).indexFinally, use the apply method and a lambda function to select only the values from the department column and collapse the remaining values into the value Other:
df2["department"] = df2["department"].apply(lambda x: x if x in top_10 else "Other")Practice
Try it yourself! Retain the remaining top 5 most frequent values in the department column and collapse the remaining values into the category Other.
Cross Tabulation¶
The pandas crosstab function is a useful for working with grouped summary statistics for categorical data. It starts by picking two categorical columns, then defines one as the index and the other as the column. If the aggregate function and value column is not defined, crosstab will simply calculate the frequency of each combination by default. Let's see the example below from the Colombia real estate dataset.
df = pd.read_csv("data/colombia-real-estate-1.csv")The following function calculates the frequency of each combination for two variables, department and property_type, in the data set.
pd.crosstab(index=df["department"], columns=df["property_type"])From the previous example, you can see we have created a DataFrame with the index showing unique observation in variable department, while the columns are the unique observation for the variable property_type. Each cell shows how many data points there are for each combination of department type and property_type. For example, there are 8 apartments in department Antioquia.
We can further specify a value column and an aggregate function, like in pivot_table, to conduct more complicated calculations for the two categorical variables. In the following example, we're looking at the average area size for different property types in the departments in Colombia.
columns=df["property_type"],Practice
Create a cross tabulate calculating frequency of combinations from mexico-city-real-estate-3.csv using currency as the index and property_type as the column.
# Create `crosstab`Applying Functions to DataFrames and Series¶
apply is a useful method for to using one function on all the rows or columns of a DataFrame efficiently. Let's take the following real estate dataset as an example:
df = pd.read_csv("data/colombia-real-estate-2.csv", usecols=["area_m2", "price_cop"])By specifying the function inside apply(), we can transform the whole DataFrame. For example, I am calculating the square root of each row at each column:
import numpy as npNote you can also specify the "axis" inside apply. By default, we have axis=0, which means we are applying the function to each column. We can also switch to axis=1 if we want to apply the function to each row. See the following example showing the sum of all rows for each column:
df.apply(np.sum, axis=0)The following code will get the sum of all columns for each row:
df.apply(np.sum, axis=1)By specifying the column name, we can also apply the function to a specific column or columns. Note that we can also specify index (row names) to only apply functions to specific rows, however, it is not common in practice.
df["area_m2"].apply(np.sqrt)We can assign the results to a new column:
df["area_sqrt"] = df["area_m2"].apply(np.sqrt)Practice
Try it yourself! Create a new column named 'sum_columns', which is the sum of all numerical columns in df:
df["sum_columns"] = ...df.aggregate(), or df.agg(), shares the same concept as df.apply() in terms of applying functions to a DataFrame, but df.aggregate() can only apply aggregate functions like sum, mean, min, max, etc. See the following example for more details:
df = pd.read_csv("data/colombia-real-estate-2.csv", usecols=["area_m2", "price_cop"])We can check what's the minimum number for each column calling the min aggregate function:
df.agg("min")Like apply(), we can also specify the axis argument to switch axis:
df.agg("min", axis=1)We can apply aggregate function to a whole DataFrame using df.agg(), or specify the column name for a subset of DataFrame:
df["area_m2"].agg("min")We can also apply a list of aggregate functions to a DataFrame:
df.agg(["sum", "max"])The syntax above will calculate both sum and max for each column, and store the result as index. Besides, we can also apply different aggregate functions to different columns. In this case, we need to pass a dictionary specifying key as column names, and value as corresponding aggregate function names:
df.agg({"area_m2": "sum", "price_cop": "min"})Practice
Try it yourself! Find the minimum for column area_m2 and the maximum for column price_cop using df.agg():
Working with Dates¶
Time Stamps¶
Pandas' time series capabilities are built on the Timestamp class. For instance, we can transform date strings of different formats into Timestamp:
print(pd.Timestamp("January 8, 2022"))We can also do date math with Timestamps. Note that when we subtract two dates, we get a special object called a Timedelta.
time_delta = pd.Timestamp("Feb. 11 2016 2:30 am") - pd.Timestamp("2015-08-03 5:14 pm")The pandas Timestamp class is also compatible with Python's datetime module.WQU WorldQuant University Applied Data Science Lab QQQQ
pd.Timestamp("Feb 11, 2016") - datetime.datetime(2015, 8, 3)We can manipulate dates using some function from offset(). We can use the Day() function to add or decrease days from a timestamp. The following function calculates the date that's four days prior to 9 January 2017:
from pandas.tseries.offsets import BDay, BMonthEnd, DayIn some case, you might need to add or subtract only business days. That's when you use BDay():
print(pd.Timestamp("January 9, 2017") - BDay(4))We can also do an offset at the monthly level. For example, you can use BMonthEnd(n) to find the last business day
print(pd.Timestamp("January 9, 2017") + BMonthEnd(0))The following function shows the last business day for May 2017, which is four months after February:
print(pd.Timestamp("February 9, 2017") + BMonthEnd(4))We can also use the strptime function inside datetime to transform string to date format:
date = datetime.strptime("16 Oct 2022 12:14:05", "%d %b %Y %H:%M:%S")We can further transform it into the ISO 8601 format:
date.isoformat()Date Time Indices¶
DatetimeIndex is a convenient function that transforms date-like data into readable Datetime format for a DataFrame index. That way, we can better plot and model time series data. Let's check an example. Here we have Apple's stock prices from 1999 10 2022.
data.set_index("date", inplace=True)Even though the index looks like dates, it is not in date format. So when we plot the data, the index doesn't follow the sequence of years, and the x-axis ticks are hard to read.
data["open"].plot()You can see the index is not in date format.
data.index[:5]We can use the DatetimeIndex function to transform it into date:
pd.DatetimeIndex(data.index)And we can set it as the index:
data.index = pd.DatetimeIndex(data.index)Now we can see the benefit from plotting:
data["open"].plot()Date Ranges¶
If we're entering time series data into a DataFrame, it will often be useful to create a range of dates using date_range. We can create it with different frequencies by specifying freq. Here are the days in a specific range:
pd.date_range(start="1/8/2022", end="3/2/2022", freq="D")Here are the business dates for a specific range:
pd.date_range(start="1/8/2022", end="3/2/2022", freq="B")References & Further Reading¶
- Pandas Documentation on Dropping a Column in a DataFrame
- Pandas Documentation on Printing out the First Few Rows of a DataFrame
- Pandas Documentation on Reading in a CSV File Into a DataFrame
- Getting Started with Pandas Documentation
- Pandas Documentation on Extracting a Column to a Series
- Pandas Official Indexing Guide
- Series in pandas
- Tutorial for
value_counts - Tutorial for
groupby - pandas Documentation for
groupby - Pandas Official Indexing Guide
- Online Examples of Selecting Numeric Columns of a DataFrame
- Official Pandas Documentation on Data Types in a DataFrame
- Pandas documentation for Boolean indexing
- More information on creating various kinds of subsets
- More on boolean indexing
- A tutorial on using various criteria to create subsets
- Pandas.DataFrame.apply
- Pandas.DataFrame.aggregate
Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
3.5. Air Quality in Dar es Salaam 🇹🇿
warnings.simplefilter(action="ignore", category=FutureWarning)from statsmodels.graphics.tsaplots import plot_acf, plot_pacfpp = PrettyPrinter(indent = 2)Prepare Data¶
Connect¶
Task 3.5.1: Connect to MongoDB server running at host "localhost" on port 27017. Then connect to the "air-quality" database and assign the collection for Dar es Salaam to the variable name dar.
client = MongoClient(host="localhost",port=27017)system.views lagos system.buckets.lagos nairobi system.buckets.nairobi dar-es-salaam system.buckets.dar-es-salaam
wqet_grader.grade("Project 3 Assessment", "Task 3.5.1", [dar.name])You got it. Dance party time! 🕺💃🕺💃
Score: 1
Explore¶
Task 3.5.2: Determine the numbers assigned to all the sensor sites in the Dar es Salaam collection. Your submission should be a list of integers. WQU WorldQuant University Applied Data Science Lab QQQQ
sites = dar.distinct("metadata.site")[23, 11]
wqet_grader.grade("Project 3 Assessment", "Task 3.5.2", sites)Party time! 🎉🎉🎉
Score: 1
Task 3.5.3: Determine which site in the Dar es Salaam collection has the most sensor readings (of any type, not just PM2.5 readings). You submission readings_per_site should be a list of dictionaries that follows this format:
[{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]
Note that the values here ☝️ are from the Nairobi collection, so your values will look different.
{"$group":{"_id":"$metadata.site","count":{"$count":{}}}}[{'_id': 23, 'count': 60020}, {'_id': 11, 'count': 138412}]wqet_grader.grade("Project 3 Assessment", "Task 3.5.3", readings_per_site)Wow, you're making great progress.
Score: 1
Import¶
Task 3.5.4: (5 points) Create a wrangle function that will extract the PM2.5 readings from the site that has the most total readings in the Dar es Salaam collection. Your function should do the following steps:
- Localize reading time stamps to the timezone for
"Africa/Dar_es_Salaam". - Remove all outlier PM2.5 readings that are above 100.
- Resample the data to provide the mean PM2.5 reading for each hour.
- Impute any missing values using the forward-will method.
- Return a Series
y.
result = dar.find_one({}){ 'P1': 24.3,
'_id': ObjectId('6334b0ef8c51459f9b1bffb7'),
'metadata': { 'lat': -6.818,
'lon': 39.285,
'measurement': 'P1',
'sensor_id': 29,
'sensor_type': 'SDS011',
'site': 11},
'timestamp': datetime.datetime(2018, 1, 1, 0, 0, 4, 53000)}
df.index = df.index.tz_localize("UTC").tz_convert("Africa/Dar_es_Salaam")Use your wrangle function to query the dar collection and return your cleaned results.
y = wrangle(dar)timestamp 2018-01-01 03:00:00+03:00 9.456327 2018-01-01 04:00:00+03:00 9.400833 2018-01-01 05:00:00+03:00 9.331458 2018-01-01 06:00:00+03:00 9.528776 2018-01-01 07:00:00+03:00 8.861250 Freq: H, Name: P2, dtype: float64
wqet_grader.grade("Project 3 Assessment", "Task 3.5.4", wrangle(dar))Yup. You got it.
Score: 1
Explore Some More¶
Task 3.5.5: Create a time series plot of the readings in y. Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels".
y.plot(xlabel="Date",ylabel="PM2.5 Level",title="Dar es Salaam PM2.5 Levels",ax=ax) wqet_grader.grade("Project 3 Assessment", "Task 3.5.5", file)Very impressive.
Score: 1
Task 3.5.6: Plot the rolling average of the readings in y. Use a window size of 168 (the number of hours in a week). Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels, 7-Day Rolling Average".
y.rolling(168).mean().plot(xlabel = "Date",ylabel="PM2.5 Level",ax=ax); wqet_grader.grade("Project 3 Assessment", "Task 3.5.6", file)Awesome work.
Score: 1
Task 3.5.7: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, ACF".
plt.title("Dar es Salaam PM2.6 Readings, ACF") wqet_grader.grade("Project 3 Assessment", "Task 3.5.7", file)Boom! You got it.
Score: 1
Task 3.5.8: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, PACF".
plt.title("Dar es Salaam PM2.6 Readings, ACF") wqet_grader.grade("Project 3 Assessment", "Task 3.5.8", file)Boom! You got it.
Score: 1
Split¶
Task 3.5.9: Split y into training and test sets. The first 90% of the data should be in your training set. The remaining 10% should be in the test set.
print("y_train shape:", y_train.shape)y_train shape: (1533,) y_test shape: (171,)
wqet_grader.grade("Project 3 Assessment", "Task 3.5.9a", y_train)Yes! Keep on rockin'. 🎸That's right.
Score: 1
wqet_grader.grade("Project 3 Assessment", "Task 3.5.9b", y_test)🥳
Score: 1
Build Model¶
Baseline¶
Task 3.5.10: Establish the baseline mean absolute error for your model.
mae_baseline = mean_absolute_error(y_train,y_pred_baseline)Mean P2 Reading: 8.617582545265428 Baseline MAE: 4.07658759405218
wqet_grader.grade("Project 3 Assessment", "Task 3.5.10", [mae_baseline])Correct.
Score: 1
Iterate¶
Task 3.5.11: You're going to use an AR model to predict PM2.5 readings, but which hyperparameter settings will give you the best performance? Use a for loop to train your AR model on using settings for p from 1 to 30. Each time you train a new model, calculate its mean absolute error and append the result to the list maes. Then store your results in the Series mae_series.
mae = mean_absolute_error(y_train.iloc[p:],y_pred)1 0.947888 2 0.933894 3 0.920850 4 0.920153 5 0.919519 Name: mae, dtype: float64
wqet_grader.grade("Project 3 Assessment", "Task 3.5.11", mae_series)Awesome work.
Score: 1
Task 3.5.12: Look through the results in mae_series and determine what value for p provides the best performance. Then build and train final_model using the best hyperparameter value.
Note: Make sure that you build and train your model in one line of code, and that the data type of best_model is statsmodels.tsa.ar_model.AutoRegResultsWrapper.
best_model = AutoReg(y_train,lags=best_p).fit() "Project 3 Assessment", "Task 3.5.12", [isinstance(best_model.model, AutoReg)]Python master 😁
Score: 1
Task 3.5.13: Calculate the training residuals for best_model and assign the result to y_train_resid. Note that your name of your Series should be "residuals".
y_train_resid.name = "residuals"timestamp 2018-01-02 09:00:00+03:00 -0.560158 2018-01-02 10:00:00+03:00 -2.193105 2018-01-02 11:00:00+03:00 -0.026408 2018-01-02 12:00:00+03:00 0.820217 2018-01-02 13:00:00+03:00 -0.078009 Freq: H, Name: residuals, dtype: float64
wqet_grader.grade("Project 3 Assessment", "Task 3.5.13", y_train_resid.tail(1500))Party time! 🎉🎉🎉
Score: 1
Task 3.5.14: Create a histogram of y_train_resid. Be sure to label the x-axis as "Residuals" and the y-axis as "Frequency". Use the title "Best Model, Training Residuals".
plt.title("Best Model, Training Residuals") wqet_grader.grade("Project 3 Assessment", "Task 3.5.14", file)Boom! You got it.
Score: 1
Task 3.5.15: Create an ACF plot for y_train_resid. Be sure to label the x-axis as "Lag [hours]" and y-axis as "Correlation Coefficient". Use the title "Dar es Salaam, Training Residuals ACF".
plt.title("Dar es Salaam, Training Residuals ACF") wqet_grader.grade("Project 3 Assessment", "Task 3.5.15", file)Very impressive.
Score: 1
Evaluate¶
Task 3.5.16: Perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv. Make sure the name of your Series is "prediction" and the name of your Series index is "timestamp".
history = history.append(y_test[next_pred.index])timestamp 2018-03-06 00:00:00+03:00 8.049707 2018-03-06 01:00:00+03:00 8.676565 2018-03-06 02:00:00+03:00 6.271593 2018-03-06 03:00:00+03:00 6.319877 2018-03-06 04:00:00+03:00 7.163557 Freq: H, Name: prediction, dtype: float64
wqet_grader.grade("Project 3 Assessment", "Task 3.5.16", y_pred_wfv)That's the right answer. Keep it up!
Score: 1
Task 3.5.17: Submit your walk-forward validation predictions to the grader to see test mean absolute error for your model.
wqet_grader.grade("Project 3 Assessment", "Task 3.5.17", y_pred_wfv)--------------------------------------------------------------------------- Exception Traceback (most recent call last) Cell In [40], line 1 ----> 1 wqet_grader.grade("Project 3 Assessment", "Task 3.5.17", y_pred_wfv) File /opt/conda/lib/python3.9/site-packages/wqet_grader/__init__.py:182, in grade(assessment_id, question_id, submission) 177 def grade(assessment_id, question_id, submission): 178 submission_object = { 179 'type': 'simple', 180 'argument': [submission] 181 } --> 182 return show_score(grade_submission(assessment_id, question_id, submission_object)) File /opt/conda/lib/python3.9/site-packages/wqet_grader/transport.py:146, in grade_submission(assessment_id, question_id, submission_object) 144 raise Exception('Grader raised error: {}'.format(error['message'])) 145 else: --> 146 raise Exception('Could not grade submission: {}'.format(error['message'])) 147 result = envelope['data']['result'] 149 # Used only in testing Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET submission API: You have already passed this course!
Communicate Results¶
Task 3.5.18: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don't forget the index). Then plot df_pred_test using plotly express. Be sure to label the x-axis as "Date" and the y-axis as "PM2.5 Level". Use the title "Dar es Salaam, WFV Predictions".
fig.write_image("images/3-5-18.png", scale=1, height=500, width=700) wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file)--------------------------------------------------------------------------- Exception Traceback (most recent call last) Cell In [43], line 2 1 with open("images/3-5-18.png", "rb") as file: ----> 2 wqet_grader.grade("Project 3 Assessment", "Task 3.5.18", file) File /opt/conda/lib/python3.9/site-packages/wqet_grader/__init__.py:182, in grade(assessment_id, question_id, submission) 177 def grade(assessment_id, question_id, submission): 178 submission_object = { 179 'type': 'simple', 180 'argument': [submission] 181 } --> 182 return show_score(grade_submission(assessment_id, question_id, submission_object)) File /opt/conda/lib/python3.9/site-packages/wqet_grader/transport.py:146, in grade_submission(assessment_id, question_id, submission_object) 144 raise Exception('Grader raised error: {}'.format(error['message'])) 145 else: --> 146 raise Exception('Could not grade submission: {}'.format(error['message'])) 147 result = envelope['data']['result'] 149 # Used only in testing Exception: Could not grade submission: Could not verify access to this assessment: Received error from WQET submission API: You have already passed this course!
Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
Usage Guidelines
This lesson is part of the DS Lab core curriculum. For that reason, this notebook can only be used on your WQU virtual machine.
This means:
- ⓧ No downloading this notebook.
- ⓧ No re-sharing of this notebook with friends or colleagues.
- ⓧ No downloading the embedded videos in this notebook.
- ⓧ No re-sharing embedded videos with friends or colleagues.
- ⓧ No adding this notebook to public or private repositories.
- ⓧ No uploading this notebook (or screenshots of it) to other websites, including websites for study resources.
Time Series: Core Concepts
from IPython.display import YouTubeVideoModel Types¶
Autoregression Models¶
Autoregression (AR) is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. AR works in a similar way to autocorrelation: in both cases, we're taking data from one part of a set and comparing it to another part. An AR model regresses itself.
ARMA Models¶
ARMA stands for Auto Regressive Moving Average, and it's a special kind of time-series analysis. So far, we've used autoregression (AR) to build our time-series models, and you might recall that AR models rely on values that remain relatively stable over time. That is, they can predict the future very well, as long as the future looks roughly the same as the past. The trouble with predicting the future is that things can suddenly change, and as a result, the future doesn't look much like the past anymore. These sudden changes — economists call them endogenous shocks — can be as big as a hurricane destroying a city or an unexpected increase in the minimum wage, and they can be as small as a new restaurant opening in a neighborhood or a single person losing their job. In our data, the air quality might be changed if there was a nearby forest fire, or if a building collapsed near one of the sensors and raised a giant cloud of dust.
Below is a video from Ritvik Kharkar that explains MA models in terms of cupcakes and crazy professors — two things we love!
YouTubeVideo("voryLhxiPzE")And in this video, Ritvik talks about the ARMA model we use in Project 3.
YouTubeVideo("HhvTlaN06AM")Plots¶
ACF Plot¶
When we've worked with autocorrelations in the past, we've treated them like static relationships, but that's not always how they work. Sometimes, we want to actually see how those autocorrelations change over time, which means we need to think of them as functions. When we create a visual representation of an autocorrelation function (ACF), we're making an ACF plot.
PACF Plot¶
Autocorrelations take into account two types of observations. Direct observations are the ones that happen exactly at our chosen time-step interval; we might have readings at one-hour intervals starting at 1:00. Indirect observations are the ones that happen between our chosen time-step intervals, at time-steps like 1:38, 2:10, 3:04, etc. Those indirect observations might be helpful, but we can't be sure about that, so it's a good idea to strip them out and see what our graph looks like when it's only showing us direct observations.
An autocorrelation that only includes the direct observations is called a partial autocorrelation, and when we view that partial autocorrelation as a function, we call it a PACF.
PACF plots represent those things visually. We want to compare our ACF and PACF plots to see which model best describes our time series. If the ACF data drops off slowly, then that's going to be a better description; if the PACF falls off slowly, then that's going to be a better description.
Statistical Concepts¶
Walk-Forward Validation¶
Our predictions lose power over time because the model gets farther and farther away from its beginning. But what if we could move that beginning forward with the model? That's what walk-forward validation is. In a walk-forward validation, we re-train the model at for each new observation in the dataset, dropping the data that's the farthest in the past. Let's say that our prediction for what's going to happen at 12:00 is based on what happened at 11:00, 10:00, and 9:00. When we move forward an hour to predict what's going to happen at 1:00, we only use data from 10:00, 11:00, and 12:00, dropping the data from 9:00 because it's now too far in the past.
Parameters¶
Parameters are the parts of the model that are learned from the training data.
Hyperparameters¶
We've already seen that parameters are elements that a machine learning model learns from the training data. Hyperparameters, on the other hand, are elements of the model that come from somewhere else. Data scientists choose hyperparameters either by examining the data themselves, or by creating some kind of automated testing of different options to see how they perform. Hyperparameters are set before the model is trained, which means that they significantly impact how the model is trained, and how it subsequently performs. One way of thinking about the difference between the two is that parameters come from inside the model, and hyperparameters come from outside the model.
When we think about hyperparameters, we think in terms of p values and q values. p values represent the number of lagged observations included in the model, and the q is the size of the moving average window. These values count as hyperparameters because we get to decide what they are. How many lagged observations do we want to include? How big should our window of moving averages be?
Rolling Averages¶
A rolling average is the mean value of multiple subsets of numbers in a dataset. For example, I might have data relating to the daily income for a shop I own, and as long as the shop stays open, I can calculate a rolling average. On Friday, I might calculate the average income from Monday-Thursday. The next Monday, I might calculate the average income from Tuesday-Friday, and the next day, I might calculate the average income from Wednesday to Monday, and so on. These averages roll, giving me a sense for how the data is changing in relation to any kind of static construct. In this case, and in many data science applications, that construct is time. Calculating rolling averages is helpful for making accurate forecasts about the ways data will change in the future.
Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited. WQU WorldQuant University Applied Data Science Lab QQQQ
Copyright 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.
- 031-data-wrangling-with-mongodb.ipynb
- 032-linear-regression-with-time-series-data.ipynb
- 033-autoregressive-models.ipynb
- 034-arma-models-and-hyperparameter-tuning.ipynb
- 17-ts-core.ipynb
- 035-assignment.ipynb
- 04-pandas-advanced.ipynb
- 18-ts-models.ipynb
- 11-databases-mongodb.ipynb
- 10-databases-sql.ipynb
xxxxxxxxxx<font size="+3"><strong>3.3. Autoregressive Models</strong></font>xxxxxxxxxxxxxxxxxxxx-
Variables
Callstack
Breakpoints
Source
xxxxxxxxxx- 1. Prepare Data
- 1.1. Import
- 1.2. Explore
- 1.3. Split
- 2. Build Model
- 2.1. Baseline
- 2.2. Iterate
- 2.3. Evaluate
- 3. Communicate Results